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"P" document, pertaining to a document published prior to the international filing date but later 
than the priority date claimed, with respect to claims 1-8 of the foreign application. The foreign 
Examiner indicated that page 423,. paragraph 6.2.1 and page 424, paragraph 6.2.2 are relevant; 

An article by A. KNOLL, entitled "MSDL: MPE 0-4- SYSTEMS AND 
DESCRIPTION LANGUAGES", at pages.459, 460, 462, 464 and 465 of FERNSEH LTND 
KINOTECHNIK, vol. 50, No. 8/09. August 1996, which was cited by the Examiner as being a 
category "P" and "A" document, with respect to claims 1-8 of the foreign application. The 
Examiner indicated that paragraph 9 is relevant; 

European Patent EP 0 648 056 to Thomson Consumer Electronics, published on 
April 12, 1995, was cited by the foreign Examiner as being a category "A" document* with 
respect to claims 1-8 of the foreign application. The Examiner indicated that the Abstract and 
Fig. 1 are relevant. Applicants note that U.S. Patents 5,430,485 and 5,467,139 are counterpart 
documents; and 

European Patent EP 0 634 870 A2 to Pioneer Electronic Corp., published on January 18, 
1995, was cited by the foreign Examiner as being a category "A" document with respect to 
claims 1-8 of the foreign application. The Examiner indicated that the Abstract; column 10, lines 
44-48; column 1 1, lines 3 1-38; and Fig. 9 are relevant. Applicants note that U.S. Patent 
5,537,409 is a counteipart document. ff ((\\ fo)^^ 



The following documents were cited by the Examiner in U.S. Application No. 09/01 1,761 
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U.S. Patent 5,784,422 to HEERMANN, which issued in July, 1998. 
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MSDL: MPEG-4-Systems and 
Description Languages 



Die Struktur und die Ziele von 
;MPEG-4-S0L, so wie sie sich zum 
jelzigen Zeitpunkt (Mai/Juni 1996) 
darstellen. werden vorgestellt und 
erlauterL Oabei muf} MSOL als 
Kern von MPEG-4 die geforderten 
FunktionalitSten unterstutzen. Die 
. Systemkonzspte mussen weiterhin 
flexibel sein und Mdglichkeiten fur 
En/veitemngen auch im Hinblick auf 
zukunftige Codierverfahren bieten. 
Die Unterstutzung inhaltsbasierter 
Interaktivitat und Skalierbarkeit er- 
folgt wesenUtch dadurch, da(3 die 
audiovisuellen Objekte getrennt co* 
diert und ubertragen werden, urn 
erst im Terminal zu einer Szene zu« 
sammengefugt zu werden, 
MSDL wifd in objeklorientierter Ter- 
minologie dargesteilt und beschrie* 
ben. Es werden audiovisuelle Ob- 
jekte ubertragen, und die Pro- 
grammstruktur des Terminals wird 
auf Grund der objektorientierten 
Terminologie durch eine Klassen- 
hie'rarchie dargesteilt. 



1. Einleitung 



Or. Angefaka Knofl ist seit 1988 tffl Forschungs*/ 
Technoto^iezsntrum d8r Telekom tn Oartnstatft 
Uti9. Ihr Hauptaftettsgebiel ist MPeG-4. Sie ist 
Mttgtied etniger natior^aier und imemationalef 
Gruppen. 



Bei den bisherigen Standards zur Ubsr* 
tragung audiovisuelter Oaten wird zu- 
n^chst die audiovisueUe Informatton. 
also Bifd- und Tonmateriaf. enupra- 
chend eTnem — sventuell vorhandenen 
— Skript aufbereitel. also gemischt und 
mil Spezialat(ekten versehen. Oas fertig 
aufbereifste Bitd-. und Tonmaterial wird 
dann mit einem star^dardisiertan Codier- 
verfahren (PAU MPEG. H.263, ...) CO- 
di^rt, gesendet und vom Emplinger de- 
codiert und prisenfiert Der jeweili^e 
Codieral^orithmus ist dabei auf die An- 
wendung zugsschnitten. 

Um eina inhaltsbaslerta Interaktivitat 
und erhdhte FlexibHitdt zu ermogltchen. 
werden in MPeG-4 dagegen einzelne 
Tails eines Bildes Oder einer Szene als 
audiovisuelle Objekte getrennt codiert. 
Codterte. audiovisuelle Objekte, Infor- 
mationan uber den jeweiltgen zur Oeco- 
dierung zu verwendanden Atgorithmus 
und raumllch-zeitfiche ZusammenhShge 
zwi^chen den verschiedenen audiovisu- 
ellen Objekten der Szene werden zu ei- 
nem Oatenstrom zusammengefaDt bzw. 
gemuHiplext Oer zum Aufbau der Szene 
im Terminal erfordarliche rdumliche und 
zeitUche Zusammenhang wird ^^cXk ein 
Skript definiert Nach der Obertragung 
und Decodienjng werden die einzelnen 
A/V-Objekte mit HOfe des Skriptes im 
Decoder zu einer Szene jrusammenge* 
mischr und mil Spaxtaleffekten verse- 
hen {Slid 1). 



2. Audiovisuelle Objekte 

Oer Ausdrxiek .audiovisueHes Objakt* ist 
ein zentraler Segr IH in MSOL. Audiovisu- 
elle Objekte (A/V-Objekte) werden in ih- 



rer codierten i=orm ubertragen. Ein sei- 
ches Objekt kann hdrbar und/oder sicht- 
bar, zwei- oder dreidimensionat. statisch 
Oder zeitlich verdnderlich sein, es kann 
naturltchen Oder, symhetischen Ur* 
5 p rungs sein, und es kann eine Kombi- 
nation davon sein. Es kann auch aus an- 
deren A/V-Objekien zusammengesetzt 
sein. 

Ist bisher der benutzte Codieratgorith- 
mus auf die Anwer^dung zugeschnitten, 
so kann jetzt der Codieralgorithmus an 
das audiovisuelle Objekt angepaOt wer- 
den. also prinzipielt jades audiovisuelle 
Objekt mit setnem eigenen Algorithmus 
t:odiert werden. So ist es beispielsweisa 
nicht sinnvoll, syntheUsche Objekte. die 
irn Rechner mit HSfe der .Virtual Reality 
Modeling Language VRML* dargesteilt 
sind, mit einem MPEQ-Z-Algorlihmus zu 
codieren. 

Im Terminal werden die audiovisueUen 
Szenen aus den einzelnen zwei- Oder 
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Slid t. Aufbereitung des Slldmaterfals. (a) In 
den biiherigen Standardi (b) in MPeG-4 
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-MPEG4:'MSDL'-^-'^ 
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' SUmma AbbiU Schre^tiach Globus BBdwand MP£G*1*Sequftru ' 



Bild 2 (oben). Aus 
A/V-Objelcten baste- 
hendeSeispielszene 



B{ld3{nnks). 
Beispiel einer Hierar- 
chia von A/V-Obi«k- 
ten 



dreidimensionalen audiovisuelten Ob* 
jelcten nach Vorgabe eines Skriptes er* 
stent Die Szene und auch die eiruelnen 
Objekte kdnnen dabei interaktiv mani- 
putiert werden. Jedes Objekt Mat im all- 
gemeinen seine eigene rtumliche und 
zeitfiche Aufldsung. sein eigenes r^umO- 
ches und zeitliches Koordinatensystem. 
Beim Szenenaufbau muf) das objektei- 
gene Koordinatensystem tn das globate 
Koordinatensystem der Szene transfor- 
miert werden. Manche Obfekte mussen 
in Echtzeit dargestellt. werden. wie das 
in TV- und Kommunikationsanwendun* 
gen geschieht. andere kdnnen durch 
den Benutzer animiert werden. 

Oas sou im folgenden an Hand eines 
Beispiets eri4utert werden: Die im Slid 2 
dargesteltta Szene ist aus einer Anzahl 
unterschiedlicher A>V*Objekte aufge- 
baut: zwetdimensionaler wet5er Htnter- 
grund als synthetisches, also im Rech* 
ner erzeugtes Objekt, die Person (be* 
stehend aus den r^aturtichen. das hetOt 
mtt Kamera und Mikrophon aufgenom* 
menen. Tedobjekten: BBd und Ton), der 
Schreibtisch und der Globus als dreidl* 
mensionalo synthetische Objekte, die 
mit Hilfe der ^Virtual Reality Modeling 
Language' (VRML) im Rechner be- 
schrteben wordeo sind, und die Prtsen- 
taiion. bestehend aus einer zweidimen- 
sionalen synthetischen Bildwand und ei- 



ner naturlichen (MPEG-l^bdlerten) Sa* 
quenz. Die gesamte Szene isl dagn ein 
dreidimenslonales Objekt Jedes Teilob- 
jekt der Szene har seine eigene Zeiiatf- 
hSngigkeit die Sequenz ais eine Folge 
von Bildern, die Person bewegt sich und 
spricht, und der Gfobus dreht sich. Bild- 
wand und Schreibtisch sind statisch. 
Da A/V-Objekte aus anderen A/V-Objek- 
ten zusammengeseizt sein kdnnen. bil- 
den sie eine hierarchische Struktiir, wia 
sie im Slid 3 dargestellt ist. 

Spezfelle VMeo-Objekte sind die soge- 
nannten .Video Object Planes' (VOPs). 
zweidimensionale Sildausschnitte. de- 
ren Textur und Umrandung ubertragen 
wird, und die in flechiecke eingebenet 
werden. Oiese VOPs sind im Artikei uber 
Bildcodierung ausfuhrtich bdhandelt. 
Sind aile Video-Objekie zweidimensio- 
naL so ist auch die gesamte Szene nur 
zweidimensional. 



3« Aufgaben und Flexibilitat 
von MSOL 



Da audiovtsuelle Objekte getrennt co* 
diert und Obertragen werden. wird eine 
inhattsbasierte Interaktivitat und ein uni- 
verseller Zugrilf aut die einzeinen A/V- 
Objekte tm MPEG-4.Terminal ermdg- 



licht. MSOL mufl dabei im wesentlichen 
/olgendes leisten: 

• Oas Zusammenseizen der Szene 
aus den einzetnen audiovisuellen Ob- 
jekten und die Unlerstutzung der in- 
haltsbasierten Interaktivitat 

• Die Seschreibung der codierten In- 
formation di eser audiovisuellen Ob- 
jekte (Syntax). 

• Multiplex und Synchronisation der fur 
die Ubertragung gemuitiplexten Teil- 
strdme. 

Flexibiiit&t und Erweiterbarkeit ist ein 
weiterer wesentlicher Punkt fur seine 
Unlerstutzung sind drei Rexibilitdtsstu- 
fen des Decoders in MSDL definiert: 

• Stufe 0: (keine Oecoderprogram ^r- 
barkeit) Es kdnnen nur stanci..di- 
sterte Algorithmen bzw. Profile ver- 
wendet werden, bzw. es kann eine 
der standardisierten Klassenhierar- 
chien ausgew^lt werden. 

• Stufe t: (Rexibifitat) Hier kann eine 
Menge von standardisierten Werk* 
zeugen (.tools*) benutzt werden. den 
Decoder in flexibler Wa'se zu konfigu- 
rieren, bzw. es kdnnen — wiederum 
mit Hilfa eines Skripies — vordefi- 
nierte generische. Klassen spezifi- 
ziert werden und daraus eine Klas- 
senhierarchle aufgebaut warden. 

• Stufe 2: (Erweiterbarkeit) Neue Werk- 
zeuge (.tpcfs*), Programmteile Oder 
Algorithmen kdnnen ubertragen und 
somit in den Decoder geladen wer- 
den. 

Zusatzlich werden von der ^eq«..e- 
ments'-Gruppe noch sogenannte .Oon* 
formance**Punkte spezifiziert als eine 
spezielte Kombination aus Rexibilitat 
(Menge von Werkzeugen und Algorith- 
men). aus Proftlen (Gruppe von Funktio- 
nalitdten) und Levels (SpeziMkation von 
Einschrankungen). 

Urn all dese Aufgaben zu bewerkstelli* 
gen. hat MSOt» fblgende Teifaspekte: 

« MSOL'A: Die Seschreibung einer 
globalen Architektur eines MPSG-^- 
Systems. 

• MS0L*O: Die Definition von Klassen 
und Ob jekten (bzw, Programmsiruk- 
tur und vordefinierte A/V-Objekte) 
des MP6G*4.Tefminals. 

• MSDUM: Fur tfe Ubertragung der 
codierten Information muO der Auf* 
bau des Bitstromes durch die Definiti- 
on des Multiplexschemas festgelegt 
werden. 
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• MSOL'S: Die Definition einer Be* 
schreibungssprache. um die Syntax 
der codierten Information zu speziH- 
zieren 

• MSOL'R: Die (nicht lu standardi- 
sierende} OeHnition einer lesbaren 
(readable) Form, das heiGt einer 
Hochsprache fur die bequeme Pro* 
grammierung. 

• MSOL'B: Die Seschreibungaspra- 
chen mussen fur die Obertragung in 
ein bindres Aquivaleni Obersetzt wer- 
den. 



4. ArchitelcturvonMSDL(MSDL-A) 



5. Oberblick ubef das 
Gesamtsystem (MSDL-A) 



In MPEG-1 und -2 wurden auf^er der 
Bitstromsyntax auch Decoder und Oe* 
codieralgorithmus spezifiziert, um ein* 
deutig festzutegen. was mit den Obertra* 
genen Oaten zu geschehen hat und wie 
sie zu verarbeiten sind. In MP£G-4 ist 
dies analog: Auch hier muQ der Decoder 
und seine Software spezifiztert werden. 
um audtovisuelle Objekte zu decodie* 
ren« sie zu einer Szene zusammenzu- 
setzen und zu prasentieren. Dies ge- 
schieht durch die Definition der soge- 
nannten ,MPcG-4-0bjekte* (siehe un» 
ten). Da auch Softwareteile heruntarlad- 
bar sein werden (Rexibtlit&tsstufe 1 und 
2), mussen auch die SchnittsteOen zwi- 
schen den einzelr^en Softwarsteilen de- 
finiert sein (API. Application Programm- 
ing Interface). 

In diesem Tail wird das Konzept des De- 
coders dargestellt und auch ein Ober- 
blick uber das Gesamtsystem, den Sze- 
nenaufbau und die Beschreibung der 
Kdordnatentransformationen. die fOr 
den Szenenaufbau erfordeiiich sind. ge« 
geben.' 



Das Gesamtsystem ist im Blld 4 darge- 
stellt Die einzelnen A/V-Objekle werden 
abjekispezifisch codiert. Bevor aber die 
jeweitigen A/V-Objekte ubertragen wer- 
den kbnnen. mufl der Decoder entspre- 
chend konfigurlert werden. Das bedeu- 
let. dafl dem Decoder milgeieill werden 
mu5. wie der Oecodieralgorithmus. bzw. 
die Klassenhierarchie fur die Dccodie- 
rung und Presentation des Objektes 
auszusehen haben. Uber den Ruckka- 
nal — sofern vorhanden — findel die 
Anforderung eventuell nicht vorhande- 
ner Module bzw. Klassen statt. Je nach 
Flexibiliiaisstufe ist diese KonfiguraUon- 
und Lsrnphase mehr Oder weniger kom- 
plex. Oann werden die Daten des_ co- 
dierten A/V-Objektes ubertragen. Uber 
den Ruckkanai — sofem vorhanden — 
kdnnen interaktiv auch weitere A/V-Ob- 
jekte angefordert werden. 
So gibt es in einer typischen MP£G-4- 
Anwandung drei Phasen, die zu ieder 
Zeit also nicht nur wShrend des Verbin- 
dungsaufbaues, auftreien kdnnen: 

• Konfigurationsphase: Hier findel ein 
Austausch der Konfigurationspara- 
meter, wie zum Beispiel die Beschrei-' 
bung des Rexibilititsgrades des De- 
coders, der Hardwaremagiiqhkeiten 
von Coder und Decoder, usw. statt. 

• Larnphase: Klassen bzw. Pfo- 
grammcode werden vom Coder zum 
Decoder Obertragen. 

• Obertragungsphase: Hier wird die ei- 
genuichs codierte InformaUon der 
A/V-Objekte uberuagerv 

Die einzelnen A/V-Objekte werden dann 
im Terminal decodierl. und mit Hiifa des 
ebentalts ubertragenen Skriptas. das 
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das r^umliche und zeitliche Zusammen- 
spiel der einzelnen A/V-Objekte be- 
schreibl, zu einer Szene entsprechen- 
der Dimension zusammcngesetzt. Die 
gesamte Szene Oder Teile davon kon* 
nen dann aut den Bildschirm projiziert 
werden. 



6. LokalesundglobalesKooftfinatefh 
system, Szenenaufbau (MSOL-A) 

Wie schon mehffach beioni. kann jedes 
A/V-Objekt getrennt codiert werden. £s 
wird dabei in seinem eigenen raumli- 
Chen und zeiUichen (30^0-Kooordina- 
tensystem dargesleOl. Oer Aulbau der 
gesamten ' Szene erfordert dann ent- 
sprechende Koordinatentranstormatio- 
nen von den jeweiligen objekt eigenen 
Koofdinaiensysiemen zum globalen Ko- 
ordinatensyslem der Szene (GI. i). Sie 
kann einfach ane Translation sein (GI. 
2), in komplexeren Fallen kann sie aber 
auch Rotation und Zoom Oder anderes 
enihalten: 

(x.y,z.rA3bitw -» lJf.>!^.Osw«- 0 ) 

Im einfachen Fall einer rein rSumfichen 
Translation hat man dann 

(x.yz;Os«n. . {KYMkK^ + (AX.^/.AZ,0) (2) 

Sind alle A/V-Objekte die bereits oben 
erwahnten VOPs. so mussen sie in der 
richtigen Reihenfolge und an der richti- 
gen Stetle {gegeben durch das schon 
oben arwdhnte Skript. das die Verschie* 
bungen {^x^yM^fi und die Schichturtg 
enmill) ubereinandcrgelegt werden. Im 
SUd 5 sind die VOPs grGn angedeutet. 
die gesamte Szene ist rot umrandet, der 
Ausschnitt fOf die Oarstellung auf dem 
Bildschirm ist schwarz und die Verschie* 
bungsvektoren (AJOy.Az:) fCir die Transla- 
tion der VOPs sind Wau. 

Fur die im Bild 2 dargestettte Szene ist 
rwturlich der Szenenaufbau komplexer. 
Die entstandene Szene wird fur die Be- 
trachtung am Bildschirm auf eine Sbene 
projiziert (SUd 6}. 



(AX, Ay, A z) 



Szene 




Bild 1 Ctsamtsystem 



Bikj 5. Szenenaufbau aus VOPs 
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hypothettscher Batrachter 



Bild 6 (rechls). Pfo- 
jektion der Szene in 
die Slldebene 



Blld7(untcn). Teil 
der Klassenhierar* 
chie des Decoders 



7. Grundlegendes uber „MPEG-4- 
Objekle" (MSDL-0) 

In objektorientierter Sprechweise wer* 
den Oaten^trukturen und der Pro- 
grammcoda oder dia Memoden. die zur 
Verarbeirung dieser Oatenstrukturen sr< 
forderiich sind. zu einer .Klasse* zusam- 
mengefaOt 8no Klassa ist ein abstrak- 
tes Gebilde. so sind beispielsweise den 
Oaten noch keine Werte zugeordnet £ln 
Objekt ist nun erne spezidde Realisie- 
rung einer Klasse, zumindest einigen 
Oaten Ist ein Wert zugewiesen, entspre* 
cnender Speicherplatz ist reserviert. ■ 
man sagt, das. Objekt ist eine tnstanz 
der Ktasse. 

Ofe Im MPEG'4-Decodar erforderfichen 
instanznerten J<tassen* oder .Objekte* 
stnd die .MP£G*4-0bjekter Oteklassen 
btlden im allgemeinen eine Nerarchi- 
sche Struktur. eine abgeteitete Ktasse 
ist ^ne Spezialisierung der darOberste* 
henden ktasse. Eine abgeleitete Klasse 
kann auf da Methoden und Oatenstruk* 
turen der darOberstehenden Klasse zu- 
greifen. man sagt. Oatenstrukturen und 
Methoden werden vererbL 

Wfihfend der Konfigurations- und der 
Lernphase wird die erlordarticbe Pro* 
grammstruktur im Oecoder aufgebaut. 
Hierzu werden die entsprechenden 
Klassisn' instanziiert. bzw. as wird tfe 
entsprechende Software geladen. Oie 
einzelnen Teile sind dabet wesenttich 
gekennzeichnet durch die Schnittsteilen 
(Interfaces Oder Ein- und Ausgabeme- 
thoden) und durcti den Programmkern. 
der die eigentliche Verarbeitung der Oa- 
ten durcMuhrt. Oa es mdglich sein muO. 
einzelne .tools' und etnzelne Ktassen 
herunterzuladen. mussen die Schnitt* 
stetlen zwischen den Kla??e- Mn- 



MPEG-4-OtJekt 



AA/-Objekt 



VOP-OoieW 



• Transformation 

• Komposition 

Piojektton 



" EinVAusgabestrom 



dulen im Oetail definiert werden. wohln- 
gegen beim J<ern" nur sichergestellt 
werden mufl, dafl die richtigen Hrgeb- 
nisse erzeugt werden. 
Ein Teil der Werarchischen Klassen- 
struktur ist im Bild 7 dargestellt. 

So enthail zum Beispiel die Klasse 
.MPEG-2-Objekr unter anderem die fol- 
genden Oatenstrukturen: codierte Biid- 
daten, decodierte BUder und minde- 
stens zwei Methoden, und zwar eine 
Methode zur Oecodierung und eine fur 
die Prasentation. Vermuilich wird eine 
weitere noch zu definierende Methode 
die Interaktion mil dem Benutzer durch- 
luhren. Oiese Klasse ist eine SpeziaK- 
sierung der allgemeineren Klasse .A/V- 
Objekr, und diese wiederum eine Spe- 
zialisierung der Klasse >lPEG-4.0b- 
jekr. 

Erganzend zur Klassenhierarchie, die 
bereits definiert wurde. mussen weitere 
sogenannle ,MPEG-4-0bie»<te'. die in. 
dan Standard eingehen soften, spezifi- 
ziert werden. Oiese .MPEG-4.0biekie' 
•i-^ nnr*T r-rrtf 'm ffin^elnen sof zi- 



fiztert. Zu ihnen werden unter anderem 
gehdren: 

• Klassen bzw. Program module: 

> Werkzeugewie 

— diskrete Cosinustrans formation 

— Entropiecodierung 

— Bewegungskompensation 

— Wavelet*Zerlegung 

> Algorithmen wie 

— H.261 

— H.263 

— MPEG-1 

— MPEG-2 

> Profile wie 

— Main ProHle Q Main Level 
(MP©ML) vonMPEG-2 



-MPHG-Z-Odjeict i « 



> Module zur Transformation 

> Module zur Szenenprojektion 

A/V-ObjeWe wie 

> einfache Bausteine fur Graphiken 
wie 

— flechleck 

— Kreis 



> einfache synthetische Silder und 
Tdne 

• Zetchenfonts 

e i,5c^altkn6pfe* zum Ankiicken 



3. Syntaxbeschreibung audir 
sueilerObjekte(MSDL*S) 

In MPEG-1 und -2 wurde lur die Be- 
schreibung der Bitstromsyntax eine der 
Programmlersprache C ^nltche Spra* 
Che venwendet. Oie Bitstromsyntax ist 
dem Coder und Oecoder gleichermaOen 
bekannt. Oie Syntax gib! zum Beispiel 
an. mit wieviel Bits ein Wert Qbertragen 
wird und ob es sich zum Beispiel urn 
eine garue Zahl ohrte Vorzeichen han- 
detr. Oiesa Syntaxbeschreibung wtrd 
jetzt weitergefChrt in objektorientierte 
Oarstellungea Sie kann — in ihrem ant* 
sprechenden bin&ren Aquival ent — 
auch ubertragen warden, so daO sie 
dem Oecoder nicht a priori bekanni sein 
muf). Oie Syntax der codlerten Informa- 
:ion kann so auf die Arrwendung zuge* 
schnttten werden. 

Zur Oecodierung eines im Bitstrom ent- 
haltenen A/V-Objekts muf) eine entspre- 
chende -Klassa im Oecoder vorhanden 
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scin. die nacheinander rwci Method en 
anwendet: .Parsing" und das eigeniliche 
.Decoding*. Sei der ersten Methode 
werden die im Sitslrom ankommenden 
Oaten geiesen, in das interne Formal 
des Decoders umgewandeit und die ent* 
sprechcnden DatensiruKiuren der Klas- 
se gefOm, die zweile Methode sorgt fur 
die eigeniiche Oecodierung. Die 
Bitstromsyniax wird so vom eigenilichen 
Oecodieralgofithmus getrennt 
Die Obertragung der Synlaxbeschrei- 
bung des Bitstroms in Form eines Skrip- 
tes konfigurierl die .Pafsing--Methode. 
In dieser syntaktischen Beschreibung 
werden die im Oatenstrom auftauchen- 
den Oaientypen definiert und die Kon- 
trolte des Oatenflusses durch einfache 
jische Sedif^ungen und arithmeii- 
sche Anweisuagen festgelegt. 
Die Festlegung dieser Skriptsprache ist 
bereils sehr weit fortgeschritten. Die 
Syntax codierter Inform alton in hisheri- 
gen Standards la^t sich sehr gut damit 
beschreibea 



9, Multi plex (MSDL-M) 

Die unterschiedlichen etememaren Oa- 
tenstrame wie codierte A/VObjekte und 
Skriplen mussen lOr die Qbertragung zu 
einem Oatenstrom oder mehreren Oa- 
tenstrdmen * falls mehr als ein Kanal 
benutzt wird — zusammengefaflt wer- 
den. Oabei mussen noch Ldsungen fur 

Igende Punkte erarbeitet werden: 
' • Ein Schwerpunkt ist die Synchrgnisa* 
tion. Die Zettbasis des Decoders muf) 
mit der jeweiligen Zeilbasis des A/V- 
Obiektes (bzw. des Coders dieses 
Objektes) in Einklang gebracht wer- 
den. 

• Auf3erdem sind auch verschiedene 
A/V-Objekte untereinander zu syn* 
chronisieren (zum Seispiet BikJ und 
Ton). 

• Die Spezfflzienjng fur die Bereitstel- 
lung und die Handhabung des Oe- 
caSerpuffers tsi ein weiieres Aufga- 
bengebiet iur den Mutttptex. 

Ziel tsl es, eine mdglichst hohe Ftexibili- 
tat der Multiplexstrukturen zu erreichen. 
urn aile Sereiche der Oatenraien 
abzudecken. So muQ sich beispiefswei* 
se die Paketldnge oder zumindest der 
Aulbau etnvs Paketes an die Datenrate 
anpassen lassen. eine Multiplexstruktur. 
wie sie bei MP6G-2 festgelegt wurde. ist 



fur Anwendungen mil niedrigen Oaten- 
raien nichl geeignel. Sie wijrde zu groGe 
Verzogerungszeiien nach sich Ziehen. 
Auch die Menge an .Overhead" muf3 der 
Datenrate entsprechend konfigurierbar 
sein bzw. sehr Wein sein. um auch Uber- 
tragungen uber schmalbandige Kanaie 
zu eriauben. 

Bislang sind im wesenitichen die Anfor- 
derungen (.requirements") erartjeiteL 
Oa die zukurtftigen Anwendungen von 
MPeG-4 noch nichl definiert sind. isl es 
ztemlich sicher. daf) Flexibilitat und Pro- 
grammierbarkeit auch auf der l^ultiplex- 
ebene erforderlich sein werden. 



10. Qbertragung von ..tools" und 
Klassen (MSOL -RundMSOLB) 

Die Enweiterbarkeit eines MP£G-4-Ter- 
minals wird durch tf e Rexibililatsstule 2 
ermdglichL Neue Werkzeuge (.tools") 
Oder neue Klassen kdnnen flbertragen 
werden. Oer Decoder ist dann im Rah- 
men seiner Hardv>rarem6giichkeiten voti 
konfigurierbar Dieser Teil ist bisher 
noch nicht bearbeitel worden. 
Die joois" Oder Klassen werden in ein^r 
Hochsprache wie C*^- Oder Java ge- 
schrieben. sie mussen in lesbarer Form 
fur den Prcgrammierer vorhanden sein 
(MSOt-R). Oie Qbertragung erfolgt in bi- 
narer Form, die einfachsle Mdglichkeit 
wurde auf der ASCII-Oarsiellung basie- 
ren (MSOL-B). 

Um im Terminal nicht noch Compiler 
vorrfilig hallen zu mflssen. wird wohl fur 
MSOL-8 der Code einer noch zu defi- 
nierenden .virtueUen Maschine" benutzt 
werden. Oie Definition dieser virtuellen 
Maschine ist allerdings von der MSOt- 
Gruppe zur Implementationsgnippe ver- 
schoben worden. VoraussichUich wird 
die ,virtue0e Maschine* auf dem soge- 
nannien Java-Bytecode aulbauen. der 
als assembierartiger Code maschl- 
nenunabhangig isu Damit wird dann die 
bin4re Form der Beschreibung wohl 
auch diesen Code benutzen. 



11. Ausblick ' 

Oas GerOst von I^SOL ist zwar bereits 
erarbeitet. sehr viel Oetailarbett ist alter* 
dings noc^ zu letsten. Am weitesien fort* 



geschriiien ist die Beschreibung der 
Syntax, gefolgt von der Gesamtarchiiek- 
lur, Fuf die anderen Teile von MSCL exi- 
slieren bisher nur Ansalze. fur den Mul- 
tiplex sind bisher nur die Anforderungen 
(^flaquirements") erarbeitet worden. 

Oie Arbeiten konzenlrierten sich bisher 
auf die Flexibilitatsstufe 1 . Oabei muO al- 
lerdings auch die Flexibilit^isstufe 2 mit 
in Betracht gezogen werden. um spSler 
Konfiikte zu vermeiden. 

Es ist ein ehrgeiziges Ziel. die noch an- 
siehenden Probleme im vorgegebenen 
Zeilplan Idscn zu woHan, denn das 
MP£G-4'System mufJ die geforderten 
Funktionalitaten wie inhaUsbasierte In- 
leraktivildt. hohe Codierelfizienz und 
den universellen Zugriif unterstutzen 
bzw. ermogiichen. Oas System mufj wei- 
terhin flexibel sein und Mcglichkeiten fur 
Erweiterungen auch fur zukunftige Co- 
dierverfahren bieten. denn Hard- und 
Software werden sich waiter entwickeln. 

So muB auch die Integration syntheti- 
scher AAZ-Objekie nodi wetter unter- 
suchl und spezifiziert werden. So wird 
man auch Computergraphiken oder mit 
CAO (Computer Aided Design) ersielUe 
Objekte und Iwtodelle und deren zeitli- 
ches Verhallen fur die Qbertragung co- 
♦dieren woilen. Auch Komponenlen von 
virtueflen Weilen. die sich aus syniheti- 
schen. das heiOt cpmputergenerierien 
und aus nalurlichen'. das heiBl mil der 
Kamera aufgenommenen Bausteinen 
zusammensetzen, soUen codierl wer- 
den, um Obertragungsbandbreite tu 
sparen. 



Schrrfttum 

(1 1 MPEG MlfZ -96: MSOL «P«f«*<»|ig;,r|J: 
sicn i.t.7 Document iSO/iSC 
jTCl/SC29WGn Nt246 

121 fleader. C: MP6G4 Syntactic Otscriptwe 
LanouaQe, A Univ«rs»i imertace lor £<• 
SISge ol Cod«d Au(ii<»viM«l Data. iNtr- 
national Picture Coding Symposium. 
bourne. Mia *9a 

131 Avaro. O.iEude. G.: MPSG-4-S0L:H»tua 

* ol a N«w Approach lor Audio^suai R^pre^ 
senution and Sianda/disaiton. image 
Com, 8or0«au«, Mai 1996 . 

[41 Knoll. A.: Stand von MPgO-4.Sy«iem und 

17. Jahrastagung de* rKTG. Wien. m« 
1996 



465 




^ST AVAILABLE COPY 



SIGNAL PROCESSING: 

IMAGE 

COMMUNIGOION 

Sigs&l Processinf: Image CmmmketUm 9 < 1 997) 385-431 ' 



The MPEG-4 systems and description languages: 
A way ahead in audio visual information representation 

Olivier Avaro' *. Philip A. Chou^ Alcxandros Eleftheriadis*, Ciarsten Hc^pcl^ 

Cliff Readcr% Julicn SignAs^ 

• FXA^'CS TEUCOM^BT, Aum dtt CMrat Udert. W/3/ hs^t La Mnutineaux C«ter. />tt»ff 
^rxtnmt, Ine^ 701 Weteh Hood, Pato Atm. Cat^ta VSA 
'Departmm qfaectriaU EaglneeHag, Cohtmbki Wrmi^\ SOO We$t 12m Sutn Moti Cddt 4712. A'rw Yort SY IQQ27. VSA 
. * THOMSOS* Multimrdia, Corporatr iitmeiiom Research S Deutsche 7homs0rhSrandt QmbH, Goeatnger Ckaussee 76, 

SOiSS Htaaater, (hnnam 
'SAMSUSC S^mktmductor, lac^ S65S S'crtk fiat tioad, San Jose, CA 9$tS4'l7n, VSA 
'FlLitKE TELECOhl-CCen, 4, me dm dm CawtH, BP $9, 3Sil2 CESSW J£l7CiV£ CBDEK Fnuite 



Abstract 

The state of the standardizauon of the Sy$iein$ part of ISO/IEC JTCI /SC29/ WGl I iMPEC^ S)iicfn8) as specified 
eariy 1997 is presented. First a rationale and the arcWteciure of the compleie Systems are described Based on the rapid 
lechnologtcal ad\*ances in softu-are and in hard«*are. the MPEG-4 Systenu provides for a franaework for the IntegraHon 
of natural and synthetic streamed and synchronised media. The differtnl fteids of interest in Systems are then devel- 
oped from the description of audiovisual. scenes to the definition of the. multiplex. The programmability of the stan- 
dard is described for composition and decompression and the language adopted by MPEC-4 on purpose of syntax 
description is fully detailed; FinaUy« a conclusion and the future evolutions of the specifications are presented, t I W 
Elsevier Science B.V. 

Keynvrds: M PEG-4 s>'Stems: Composition; Audiovisual scene description: Multiplex; Synchronization: Buffer manage- 
ment: Imeraaion: APIs; Programmabihty 



1. Introduction 

The fundamental goal of standardization has alwa» been to provide universal interoperability. An 
adjunct to that has also been to consider the cost of implemteniation, so a secondary goal has been to provide 
universal alTordability. In many cases, riotably the case of audiovisual data coding, this has meant that a very 
specific method has been chosen for standardization, because the lowest cost has been achieved vtixh 
a fixed-function solution. The penalty has been a limit to flexibility, both in the context of being application- 
specific and in extendibility to future applications and operating environments. 
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This situation is currently changing, under the influence of rapid technological advances, and declining 
price points. Most notably, ihc dock speeds of CPUs and DSP chips arc increasing very quickly wth 
consequent increase in computational power. As a result, it is becoming cost efTcciivc to solve traditional 
problems in programmable systems. Indeed, the future is in software. Changes arc also occurring in display 
technology, though at a slower place, and there is a trend toward higher resolution, progressive scan, high 
refresh*rate displays. On the audio side, spatial audio is becoming important for entertainment applications 
such as games. 

Such developments provide a whole new degree of freedom for designing audiovisual communication and 
storage systems. Being software based, there is less compelling need to standardize a specific algorithnu so 
a set of algorithms covering a range of applications can be considered Also the set can be extended in the 
future. The changes in display technology mean there can be a decoupling between the coded representation 
of the data and the presentation of the data. It is interesting to note that notwithstanding the advanced 
technology deployed in the MPEGl/2 standards, the input and output data arc still the same analog TV 
formal invented abnost 60 years ago! The important point to note is that ihe4ata structure of the coded data 
has been forced in the past to be that of the presentation formaL The M PEG 1/2 syntax is composed explicitly 
of coded frames, but these data structure densents are not interesting in terms of comprising real-worid objects. 

Standards are composed of 'normative requirements*, and in the case of the MPEGXfl standard^ it should 
be noted that most of the normative requirements are expressed in the syntax and semantics of the 
standardized bitstreatn. The elements not expressed in the syntax concern the (fixed) data structure and the 
(fixed) coding algorithm. It is of course tempting to propose that all normative requirements be expressed in 
the syntax, and then to define the syntax (with semantics) as a programmable communication language, not 
a rigid specification. 

MPEG-4 takes advantage of these underlying de\*elopmenls. It prov'ides a coded representation of real- 
world audiovisual objects, as opposed to presentation-based images of thkose objects. It provides a truly 
generic language for the communication of audiovisual objects. This then establishes a very flexible 
environment that can be customized for specific applications and that can be adapted in the future to take 
advantage of new developments in coding technology. 

With all the normative requirements expresj^ed in the bitstream, with an object-based data structure, and 
with a software-based implemcnution, a user-driven, fully interactive environment is possible. Imagine that 
the user can access an audiovisual *scene* that is three-dimensional (both for audio and video), that has 
a spatial extent and spatial resolution far higher than the presentation device used to access it, and that is com- 
posed of audiovisual objects, animated in real time. Further imagine that sonic objects are being generated by 
the encoder, but that other objeas may have been generated in the past and have been downloaded or arc 
being generated locally. Interactivity can now take two forms: first the ability to move the presentation 
window around this scene, and zoom in and out of it. and second to interact with the audiovisual object 
themselves. This is the environment that MSDL (MPEC-4 Systems and Description Languages) enable. 

1 MPEC-4 S>*stems - the big piciure 
2J. Scope 

MPEG-4 Systems describes a coherent framework within which the requirements [I] of MPEG-4 can be 
satisfied. It provides a foundation on which other MPEG-4 areas of work can be built. These areas include: 
* the traditional MPEG areas of w^rk: audio representation [2], video representation [3]. multiplextng and 

synchronization [5], and 

- new areas of work: composition of audiovisual information [4, 5]. representation of synthetic media [4], 
and flexibility [5. 6], that are now required to achieve the MPEG-4 goats. 
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Fig. 1. MPEG-4 areu of work 

Flexibility is a feature that can be applied orthogonally to any of the previous areas (fkxiblc mulliplcxing, 
flexible audiovisual representation, flexible composition). Flexibility enables description and configuration of 
the system. 

Systems specifics a system for communicadng audiovisual information, that is, the reprcscnta- 
tSft'Sf physical or virtual objects that can be manifested audibly and/or visually. At the encoder audiovisual 
information related to a physical scene is compressed, error protected if necessary, and multiplexed in one or 
more coded binary streams that are transmitted. At the decoder, these streams are demultiplexed, error 
corrected, decompressed, composited, and presented to the end user. The end user is given an opportumty to 
interact with the presentation. Interaction information can be processed locally, or transmitted to the encoder. 

The current section presents the functional description of the MPEG-4 Systems; Previous MPEG 
standards were mainly hardware driven. MPEG-4 requires extensibility and therefore needs a software 
architecture. Flexibility is first presented; followed by the main elements of the software architecture and the 
associated communication model Then, the overall structure of an MPEG-4 terminal is described. 

The areas of work related to the MPEG-4 S)'stems (shaded elements in Fig. \Y composition jflcxible and 
non-ftexibki flexible decoding tfiexible decompression and syntactic decoding) and multiplexing are then 
detailed in Sections 3-6. 



ZZ nexibiUiy 

MPEG.4 specifies a system that allows for communicating audiovisual information. In order that the 
information transmitted or stored can be understood by the decoder, a template for this information has to 
be known at the receiver side. In its previous standards, MPEG defined rigid a priori known templates for the 
transmitted information. The MPEG-4 standard requires a more flexible representation of these templates, 
so that they can be transmitted and allow for configuring the receiving system. In support of flexibility. 
MPEG-4 defines two diflerent types of terminals: non-flexible and flexible. 

2,2,!, Son-flexibie terminals 

2,2.U. Definition. Non-fiexible terminals. ha\-c a small degree of decoder flexibility. This flexibility corres- 
ponds to the flexibility achieved by previous MPEG terminals regarding audiovisual information representa* 
tion: it is achieved with the use of switches or selectors in the binary stream, which are basically n-ary 
elements that select which of n pre-defined templates will be used for the coming information. This allows, for 
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example, ihc choice of a predefined standardized configuration. The switch flexibility is extended in 
MPEG-4 to other areas of vvx>rk such as composition and multiplexing. 

2J,U, Specification. Non-flcxibic tcrminab arc based on a set of standardized elcmenu called tools that 
are combined in pie-dctermined sundardized ways to provide a set of standardized algorithms and profiles 
[11 A given standardized combination fully specifies the information template for the coming informaUon. 
Choices between templates can be achieved with the use of switches. 

Umiraiions. Thisiypeof flexibility was suffidcni within the scope of previous MPEG m 
has the attractive feature of being simple, practical and bit efficient. It poses significant limitations, however, 
within the context of MPEG-4: 

1. There is no explicit representation of the information template Therefore, all the possibly needed 
configurations have to be specified during the standardization process. As a resulu terminals will no^ be 
able to communicate if the information template is not known a priori, even if they have very compatible 
capabilities. 

1 The switch flexibility is hard to manage when the number of profiles and.tools increases. At this stage, it is 
not known whether or not the switch flexibility will be sufiicicnt to efficiently cover the vanous 
applications targeted by MPEG-4. 

3. The evolution of the standard is hard to manage: switches have to be anticipated during ihe standardiz- 
ation process, and future needs are hard to predict Therefore, new applications may require a new round 
of standardization, even if they use only already standardized tools. Moreover, introducing new tools 
requires the explicit standardization of new algorithms and profiles. 

2 J J. Flexible itrminab * 

2.2,2.L Definition. MPEG-4 defines an enhanced flexibility mode in the representation of information 
templates. A terminal supporting this mode is called a flexible terminal. To represent the information 
templates. MPEG-4 decided to use classes (in the object-oriented sense of the word) and to encapsulate all 
audiovisual information in audiovisual objects (AV objects) instantiated from audiovisual classes (AV 
classes). • 

More generally, to have a consistent structure, the MPEG-4 specification adopts an objca-oriented 
approach for the' description of all elements of its architecture. Hence, the object-oriented terminology is 
adopted as much as possible in the rest of this document. 

In flexible terminals, flexibility is achieved by the transmission of new classes, defining therefore 
new templates for the transmitted information. The class definition defines both the data structure 
and the inethods that arc used to process the dau components. The downloaded classes may be related to the 
various parts of the MPEG-4 flexible urminal: new composition tools, new decoding or demultiplexing 
algorithms, etc 

The overall architecture for conununicating AV objects and AV classes is as follows. Before the AV 
objecu arc transmitt^. the encoder and decoder exchangc configuration information. The encoder deter- 
mines which classes of algorithms, tools, and other objecu are needed by the decoder to process the AV 
objects. Each class of objects is defined by a data structure plus executable code. The definitions of any 
missing classes are downloaded to the decoder, where they supplement or override existing class definitions 
installed or pre-defined at the decoder. As the decoder executes, new class definitions may be needed. In such 
a case, the decoder can request that the encoder downloads specific additional class definitions. The 
additional class definitions may be downloaded in parallel with the transmitted dau. The abo\*c aspecu arc 
illustrated in Fig. 1 Such a mechanism provides the decoder with the flexibility and extensibility desired in 
MPEG.4. 
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Fig. 2. Arcfaitccf UK of 6«iible tetminatt. 
2.2,2 J, Specification. A practical specificadon of flexible icnninals needs: 

- a standardized class hierarchy, definition of classes, their associated methods, and their interfaces or APIs 

(Application Programming Interfaces), and 
* a transmission format for downloaded classes. 

To transmit classes, MPEG-4 investigates both the use of machine independent bytecode (like the Java 
byiecode) and the use of scripts (like VRML binary scriptsli The two approaches are complcmcntery. 

The bytecode approach has the power of a general programming language. It will be used when few 
assumptions can be made on the templates lo describe (eg., the description of a new decoding algorithm, the 
description of a complex interpolation funciionK Within the context of the MPEO-4 standard, only the 
executable equivalent of the language has to be specified: it is this data that eventually arc transmitted to the 
terminal. In the current Verification Model [6), the Java language is used for test and validation purposes, 
and the bytecode approach would imply the siandardiTation of the Java Virtual Machine. However, the 
specification is stiU un^Jcr development and other suitable object-oriented languages that fulfill the MPEG-4 
requiremenU (including real*time constraints) may be considered. 

Scripts are a less powerful but more concise approach to describe information templates. They may be 
used when the description of a new class does not need the expressiveness of a programming language like 
C++ or Java. In pan^ular, a wide range decomposition information templates can be described this way. 
An extension of VRML 2.0 is the current reference to test and validate the script approach for describing 
composition templates. Here also, oiJy a binary equivalent of the script will be standardized and not the 
textual format. 

2X2 J. Limiiations, From a computational point of view, the language used within flexible terminals has the 
capability of a full Turing machine and, therefore, any algorithms can be described with this approach. 
However, due to the machine independence requirement for this language, the description of complete new 
algorithms could be meflkient, and proprietary code may significantly outperform a decoding system 
described with this language. At this point, strategies to ensure high-quality peribrmancc for compliant 
terminab are still under development 

A first element of these strategies is to limit bytecode flexibility to MPEG-4 areas of work where the 
technology allo^-s real-time performances, at the date MPEG-4 becomes a standard. The priority has 
therefore been given to ComposiUon and AV Objecu behavior, where satisfactory results can be obtained. 
This area is also where the market has shown an increasing interest 
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A second clcmeni is envisioned that compliant terminals would only have to provide performances for 
standardized tools, or standardized seU of tools. Based on the tools APIs and on the downloaded bytecode. 
new algorithms could be described with satisfactory performances. Flexibility could then be extended to 
other MPEG-4 areas of work such as audio and video algorithms configurattoa 

Finally, depending on the evolution of the technology and market needs, flexibility could be extended to 
downloading of tools in later phases, but is currently out of the scope of MPEG-4. 

A pictorial representation of the terminars capabilities within the current scope of MPEC-4. with respect 
to standardized and programmablecomponents, is given in Fig. 3. Tools, algorithms and profiles refer to any 
of the previously defined MPEG-4 areas of work. 

Zi. Objects in the MP£€'4 Systems context 

The MPEG-4 specifications adopt an object-oriented approach for the description of all the elements of its 
architecture. Basic object-oriented terminology can be found in [14]; This section gives an overview of the 
more relevant MPEG-4 objects. 

2J.L JfJ>EG'4 objects 

MPEG-4 objecu are objects in the object-oriented terminology, that is. entities that combine a data 
structure (defining the object's state) with a set of methods (defining the object's behavior), A method is an 
executable procedure associated with an object that operates on information in the object's data structure. 
Gasses are templates for objcctiL MPEG-4 standardizes a number of pre-defined classes. This set of classes is 
called the MPEG-4 standard class Hbrary. In flexible terminals, based on this library, the user or encoder will 
be able to produce new encoder-defined classes and instantiate objects according to these class definitions. 

Gasses in general and MPEG-4 classes in particular, are organized hierarchically. The root of MPEG-4 
classes is named MP£G40bjecc: The hierarchy specifies how a class relates to other classes, in terms of 
inheritance, association.or aggregation. Graphical methods to represent this hierarchy are commonly used. 
The OMT (Object Modeling Technique) notation has been chosen ^^ith^n the context of MPEG-4 Systems. 

2J,2. Audiorisyai objects 

AVObject is an MPEG-4 dass deri\*ed from MPEG40bject. The AVObjoct class is used to represent 
natural or synthetic objects that can be manifested audibly and/or visually. AV objecu are generally 
hierarchical in the .sense that they may be defined as composites of other AV objects, which are called 





subK)bjecis. AV objccis thai are comporilei of sub-olgecu arc called compound AV objects. All other AV 
ofcjecu are wiled priinm\-e AV objects. The top*most object in the hierarchy is cafled the Scene. The 
procedure of building a compound AV object from other AV objects is called composition. All AV objects 
have 8 method to composite the visible or audible representation of an AVObject at a given point in time, 
the render method. 

AV objects, like other objects, have a behavior defined by their methods. In particular, AV objects have an 
audiovisual behavior. This behavior, which defines how the object is manifested, can be. programmed in 
continuous time (eg, a bouncing ball), can be updated with input data like user evcnU or input streams (c.g^ 
a moving picture, a game character), or both. Each AVOb j ec t has a handle method to specify the behavior 
of the object in case of external events. Each AVObject may be connected to one or more data streams to 
receive input data. 

2 J J. Process objects 

AV objects may contain embedded or streaming data. In general these data arc stored or transmitt^ in 
a compressed format that cannot be directly used for the rendering procedure. Signal processing operations 
thus have to be performed to reconstruct the audiovisual information thai will subsequently be rendered. 
Each distinct processing operation (cg^ linear transformation, prediction, filtering) is modeled as a process 
objea. These objects have an apply method taking parameters like A V objects or data streams. They define 
processing operations used to modify other MPEG«4 objects. 

MPEG-4 defines a language, the syntactic description language tMPEG-4 SDL), to describe the precise 
binary syntax of an AV object's compressed information. This language is used to describe the syntactic 
representation of objects in an integrated way vAxh the overall class definitions. 

Stream objects 

An AV object reads the information h needs to update its state from InputStreaon objects. These objects 
constitute the interface to MPEG-4 entities, named elementary streams. Data and control information 
(description of classes, configuration information, etc.) for a Scene may be carried in several concurrent 
elcmentao' streams. It is the purpose of the MPEG-4 multiplexer to control the delivery schedule of these 
clementar>' streams and. hence, real-time decoding with limited delay, jitter and buffer requirements. For 
transport through a network, or storage, these streams are packed into entities called logical channels and 
may be multiplexed in a single data streaoL Logical channels are virtual links between a sending and 
a rcceix-ing multiplexer entity and are characterized by the quality of semce parameters negotiated when 
opening the channel. AV objects whose data are represented in individual elementary streams may be 
subjected to bitstream editing. This allows, eg., extractmg all information related to such an AV object by 
extracting their elemenury streams without the need for any decoding. 

2,4, Communication structurt 

The MPEG-4 specification adapts the traditional two-ienninal model of a communication system, in 
which one terminal (the encoder) encodes a sendees message and transmits it across a communication 
channel and the other terminal (the decoder) decodes the message and presents it to a recipient 
. This traditional model is augmented by allowing multiple channcb and by allowing 'upstream* channels 
(outgoing from the decoder) in addition to 'downstream' channels (incoming to the decoder). In particular, 
point-to-point, multipoint- to-point, and posnt-to-multipcint communications may be supported in this way. 
The MPEG-4 specification assumes that the primary fiow of information is downstream. Whenever they are 
available, upstream channels may be used for informing the encoder of the decoder s capabilities, negotiating 
the transmission profile, requesting additional infomuttion from the encoder, reporting errors, feeding back 
user controls, and so forth. 
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Streams (either downstream or upstream) arc also classified cither as control or data streams. Control 
streams carry control information, such as connection setup, profile, and class definition information. Data 
streams carry all other information, principally coded audiovisual objects, but possibly also information such 
as Huffman tables, filter coefScients, or interaction information. 

The communication session is partitioned into five phases that are run sequentially or in parallel: 

- Connection: In this phase, using services of the underlying trainsport layer, a connection is established 
between the encoder and the decoder by exchanging Encoder Stan Session (ESS) and Decoder Stan of 
Session {DSS) information. Either encoder or decoder may initiate the connection. 

- Configuration: In this phase, the encoder and decoder agree on a profile (set of tools and algorithms) by 
exdianging Encoder Configuration Information (ECI) and Decoder Configuration Informaiion (DC!) mess- 
ages describing their capabilities (implemented classes, computational capabilities* etc). 

- Learning: In a communication involving flexible terminals, one or more dass definitions follow one after 
another in sequence. A Class Definition {CD) specifics both a data structure and executable methods. It is 
possible that somewhere in the chain of actions resulting from a call of a method at the decoder side, 

* a reference to an undefined class is made. This generates error at the decoder, unless upstream control is 
available, in that case, the decoder can issue a Class Definition Request [CDR) to the encoder, load the 
class, and proceed 

- Transmission: In this phase, coded audiovisual objecu {and possibly Other infonnaiion) are placed in th^ 
data streams by the encoder and removed from the data streams by the decoder. The information format is 
determined entirely by the current classes in use. 

- Disconnection: in this phase, the connection between the encoder and the decoder is broken by exchanging 
Encoder End of Session [ESS) and Decoder End of Session (DES) information. 

These communication phases are illustrated in Fig. 4, along with the collection of streams established in 
general between the encoder and the decoder, and the typical flow of information in those streams. 



2 J. Processing stages in the MPEG'4 terminal 

Ao MPEG-4 terminal is a system that allows presentation of an interactive audiovisual scene from 
audiovisual coded information. It can be either a sundalone application, or part of a multimedia terminal 
that needs to deal with MPEG-4 audiovisual information representation, among others. 

The basic operations performed by such a terminal are the following: a description of the audiovisual scene 
is sent downstream to the decoder. The scene is instantiated. Data representing A V objects are demultiplexed 
and then processed by the relevant decoders. The decoded A V objecu arc then composited and rendered in 
order to be presented on the terminal presentation device. These basic operations are depicted in Fig. $. 

In the current implenientution. the operation performed by such a viewer is the following: one class at the 
viewer side defines a *Main* AV class deris-ed from the dass AVOb ject. This class is instantiated as a •main' 
A V objecu This 'main' object is the A V scene. i.e., the top-most A V Ob^ in the hierarchy. The main object's 
render method (inherited from AVObjecc) is then repeatedly executed. This operation produces primitive 
AV objects, such as Image and Waveform objects, which will then be presented to the user through output 
devices. 

The main object*s render method is called once for each audiovisual frame that the viewer wishes to 
present. This method invokes other methods and other objects as for example: 

- calls to render methods of other A V objects related to the 'Main* dass. 

- calls to apply methods of process objects to recover image and audio waveform objects from compressed 
data, 

- calls to syntactic decoding methods to extract compressed data from AV objects data streams, and 

- calls to demuhipkxing methods to extract elementary data streams from logical input channels. 
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In this way. repeated calls to the main render method cause the input information to be decoded and 
rendered. Basic functions related to the terminal behavior are now described. 

2.5./. Composition and rendering 

Composition organizes the AV objecu in the scene. Each A V objea has a local coordinate system, and the 
local coordinate system of the main AV object (the scene) is also the world coordinate system. To composite 
is to place all AV objects in the scene, Le, lo define the mapping from the AV objects* loc^ coordinate 
systems to the world coordinate sysum. Rendering projects the scene onto one or more audiovisual frames, 
for subsequent presentation to the user. Composition is currently implemented by calling the render 
method of AV sub-objects, each in its own composition context 

In non-flexible composition, the composition information is transmitted (retrieved) at the receiver with 
a ^vcn syntax and semantics (eg., transformation parameters and timing information coded with a given 
numtw of biuX This syntax is specified in Section 3 for the 2D case. 

More complex scene descriptions are needed for some applications. In support of these applications. 
M PEG-4 defines a flexible composition mode consisting in a rich scene description format for composition of 
streamed and synchronized audiovisual information. In this flexible mode, configuration information (classes 
or scripts), can be sent to the receiver to complete composition functionalities and describe complex or 
custom composition. 

2.5.2. Decompression 

Decompression recovers the AV objects* data from their encoded formats, and provides the composition 
layer with this information. Decompression is currently performed by successive calls to the apply methods 
of ProcessObjects to recover the needed image and audio waveforms from the input streams. 
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. To the non-flextble decompression algorithms, specified by the Video. Audio and SNHC sub*groups, will 
correspond specific standardized ProcessObjects. These objects can then be ttsed lo construct for 
example new compound interactive scenes. 

For a given AV scene, the best suited compression format for a given A V object in this scene may change, 
during the scope of the application, or from one application to another. Therefore, different coded 
representations for a given AV object could be used. Such a f!exibiifty can be achieved by defining and 
downloading ProcessObject classes that win replace the ones previously called by the decompression 
procedure. The MPEG-4 Systems thus allow a separate description of the decompression and composition 
procedures. 



2,SJ, Syntactic decoding 

Syntactic decoding is the process of recovering basic semantic entities (such as quantized values and time 
references) from elementary data streams and mapping, them to the internal machine representation. 

In order to allow separate and efficient configuration of syntactic decoding* the MP£C*4 specification 
made the choice to decouple as much as possible the definition of the bitstream syntax (including entropy 
coding) from the subsequent operation of decompression including de-quaniization) and rendering. 

To achieve this, MPEC-4 defines a language, the syntactic description language (MPEG-4 SDL), to 
describe the precise binary syntax of an AV object's compressed information. This language is used to 
describe the syntactic representation of ob>cts in an integrated way with the overall class definitions. 



a Aim Vff a/. : Signni Procttsint: Imaze CommwOcotion 9 (imi SBS-^St 39$ 

Multiplexing 

Muhiplcxing provides the interfaces between the network and storage media and ihc other s>^icin layers, lu 
basic function is to rtcovcr elemenury data streams from logical downstream channels and to multiplex 
upstream data in logical up«ream channels. These clcrocntary streams may carry either object data or control 
information related to olgccts or to system management The muluplcx layer provides system control toolv^ gn 
to reorivc new classes, to manage syndirbnization, to recover the system lime base, and to manage system buffers. 

ZdL Etolution of MPEG-4 Systems 

Iht MPEG-4 Systems is the result of needs and convergence in the technology of audiovisual rcprcsenta- 
tion. the three key technologies: audiovisual representation (compressed streamed and synchronized, 
natural and syntheUc, 2D and 3D audiovisual objectt), audiovisual compodtion (2D and 3D worlds 
composed of objecu having behavior and responding to interaction) and programmability (downloading of 
software modules in a platform independent bytecode) have to converge in a consistent way in order to 
provide a satisfactory and integrated solution to market needs. 

This integrated solution does not yet exist Indeed, the three previous key technologies arc just becoming 
sufficiently mature to acknowledge thdr inadequacy to solve alone the entire problem. To provide the place 
where the best technical solutions in each of these fields can be integrated in a consistent way. the current 
MPEG-4 Systems specification still needs locvolvc: the description of composition information using scripts 
needs to be integrated with the current APIs, which itself need to be completed, a possible harmonization of 
2D and 3D representations could be done, events models have to be developed. 

To achieve this, the Verification Model methodology, applied with success within the context of Audio and 
Video algorithms specifications, is now set up for Systems activities CO- This methodology provides 
a framework for collaborative development and software sharing wthin the MPEG community. Evolutions 
are driven by the results of experiments made on the Verification Model and ensure a constant validation and 
. evaluation of the specifications. 

3w Compositioo - preparation for presentation 
i./. Scope 

. In the graphic arts, 'compositing* refers to the process of creating a single image from multiple overiapping 
images, by blending them together in an appropriate way. 

The Compositor, in MPEG-4, peribrms compositing in this traditional scnse(using digital images of courscX 
but it also does much more. In particular, it coordinates the process of composition, in which two- aiid 
three^imensional time-varying visual objecu are rotated, scaled and positioned with respect to one another in 
spice and in time! and it handles the proQCSS of rendering, in which the composited visual objects (at a given 
instant in time) arc 'imaged* or 'rasurizsd* to produce image fragments. The resulting image fragments are 
finally blended to produce the desired image. In addition to the above, the Compositor in M PEG-4 performs 
the analogous composition, rendering, and blending (Leu mixing) operations for audio objects. 
The effects that can be produced with the MPEG-4 Compositor include the following: 

- allowing user navigation through a time-varying 3D environment: 

- placing a scaled-down video in the comer of another video (i.c^ picture-in-picture)t 

- overlaying text such as subtitles or scrolling credits, on top of a movie: 

- defining moving 'hot* objects in a scene that respond to user interaction; 

- providing audible or visual feedback (e.g^ color changes) as the result of user selection: 

- integrating video, still images, graphics, animated graphics (cg^ sprites) and audio into a single audiovisual 
presentation; 
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- positioning buttons, pull-down menus, and other user-interTace dements on top of a presentation; 

- modulating point sources of audio to appear spatially localized (e.g. becoming louder when approached); 

- stereo vision, multichannel audio; 
and so forth. 

Coordinate systems play a central role in producing these eflects. Coordinate systems are the ^handles* by 
which audiovisual objects can be rotated, scaled, translated, delayed, and otherwise maniputetcd. For this 
purpose, every audiovisual object in MPEG-4 has an associated local' coordinate system. The pose of an 
audiovisual object within a scene is determined by a coordinate transformation from the object's local 
coordinate system into the scenc*s ^global* coordinate systent 

In MPEG-4, coordinate systems generally have four dimensions: three spatial dimensions (x. }\z] and one 
temporal dimension (iK This is sometimes denoted '3D+T. By convention, the spatial portion of the 
coordinate system is right-handed,' and the temporal portion of the coordinate system has its origin at the 
beginning of the audiovisual object with t increasing towards- the end of the object. 

Transformations between coordinate systems are generally afiine (i.e. linear with a translation compon- 
ent>. thus afRne coordinate transformations mapping (.t, y» z. h to {x\ y\ z\ t% can be described in matrix 
notation, as follows: 
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Note that the extra zeros in the matrix fos(fii4)iiHiiatiai and temporal componenu to be separable, in the 
sense that spatial coihponents cannot 'leak' into the temporal-.cotnponent.'and vice versa. Similarly, other 
components may be constrained. For example, when full 3D manipulation is impractical, coordinate systems 
may be restricted to '2JD + T, which n»ans that coordinate^ transformations take the form 

B, 
B, 
B, J 

In this case, transformations in the .x-y plane cannot leak into the : dimension, and vice versa. The 
; dimension is provided only for non-scaleable translations in depth. 

In general a scene has temporal extent, and is composed ofaudiovisual objecU In various spatio-temporal 
poses determined by coordinate transformations. The parameters of these transformations (and hence the 
poses) may change continuously and smoothly over time. Furthermore, the audiovisiial objects in a scene 
may disappear and reappear, and new ones may be introduced, at any positive instant in time. Thus a scene 
and itt objects have continuous-time behavior, regardless of any discrete units used to sample, code, or 
present the objects. This continuous-time model of object Miavior makes it possible to decouple the 
presentation frame rate from any underlying sampling and coding rates, so that audiovisual objects with 
differing underl>'ing sampling or coding rates can be presented together. It also makes it possible for 
presentation frame rates to vary from decoder to decoder, or even vary in time uithin a decoder, so that the 
presentation can scale to the computational capacity of the decoder. 

The pose and rendering propenies of each audiovisual object in a scene as a function of time may be 
described by a program or may be interpolated from sampled data, or both. Such information is called 

•Ida rifht'handal coordin3tes>^em. if the x / coordinate plane h viewed »uch ilut jr increaic* from Icfl lo righi and >increuc« from 
bouom to lop. then z tRcrea>e( touirdt the vie^-er. 



a Avero ft at. : Signai Pncwuhtg: Image CammmkaxhM V 09971 SH^Si 



397 



composition information. In non-flcxibic lenninals, ihis information takes the form of coded parameters. In 
flexible terminals, this information takes the form of cxecuUble instrucUons, which tngger operauons 
through the API. Non-flexible composition is described next, followed by flexible composition. 



The non-flexible approach to composition 

The non-flexible approach to composition can vary significantly, depending on the Systems profile. This 
section describes a simplified version of the non-flexible composition syntax currently speafied in the 
Systems Working Draft [5]. It should be taken as an example of the kind of Systems profile for composition 
that can be achieved with non-flexible terminals. 

In this non.flexible syntax, the pose and rendering properties of each audiovisual object in the scene can be 
established and can be updated occasionally or even continually. 

Whenever composition information (i.e. poM and rendering properties) for an audiovisual object mtist be 
established for the first time, or changed, the composition information is coded and transmitted in an 
elementary stream dedicated for this purpose. The composition information is tagged with a code for the 
audiovisual object to which it should apply, and the information is timestamped to indicate the time at which 
the iniformaiion is to take eflect the time at whidi it is to supersede old information for that audiovisual 
object, if any. Composition mformation may be sent once, and never updated, or it may be updated ncariy 
continuously (at the presentation frame rate, for exampieli 

The composition information for an audiovisual object, as illustrated in Fig. 6, consists of first one bit to 
indicate whether or not the object is visible (or audibH If the object is invisible (or inaudible!, then no further 
information is included This bit can be use to hide temporarily transmitted A V objects. Otherwise, five bits 
are used to code a composition order parameter, which indicated the order in which the audiovisual object 
will be rendered relative to the other objects in the scene. Each object should have a unique composition 
order number The remaining bits are used to code the parameters of the coordinate transformation that 
maps the object's local coordinate system into the scene's coordinate system, this is accomplished using one 
bit to indicate 2D or 3D, a few more biu to indicate how many parameter sets are coded and finally ten biu 
or more for each of the coded parameters. The parameters are ordered by their importance (e.g. the x and 
y translation paranwicrs, and are the most important), so that the number of parameter sets coded, 
along with the 2D/3D flag, determines ^vhich parameters are coded 

Thus initial poses can be established for all objects that arc initially visible in a scene, by transmitting for 
each ob>ct an object identifier, a ttmestamp, a visibility bit. a composition order parameter, and some 
number of coordinate transformation parameters. The visibility, composition order, and pose of any audio- 
visual object can be changed at any time. 
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Between updates* the composition information is held constant In this non*flcxibk syntax. This is 
a pseccwise-constant. continuous-time model, in which the composition informaiion is defined for all positive 
time. Likewise, audiovisual objects thcrosdvcs arc diefined for all positive time (although their values may be 
temporally interpolated from an underlying discrete-time representation). Thus it is possible to render (onto 
an output frame) a scene composed of audiovisual objects, at any positive instont in time. Indeed, MPEG-4 
does not specify a presentation frame rate, and it is not necessary to do so. 

For brevity, the discussion now is restricted to t?t$«al objects. The appearance of a visual object at the lime 
it is rendered onto a video frame for presenuiion is determined by a number of factors, including both the 
objccfs composition order and its depth, or distance from the projection plane. The depth information is 
used to determine the appearance of an object in the presence of other opaque occluding objects. However, 
for objects at exactly the same depth, and for semi-transparent objects at any depth, the order in which the 
Objectt are blended onto the video frame is also needed to determine the objects' appearance. This order 
information is given by the objects* composition order parameters. 

Rendering of a visual object is in general accomplished by projecting the object onto the projection plane, 
resampling the object to match the pixeUzation of the target video frame, computing the color and 
transparency of each object pixel (which may vary according to the objcct*s orientation, surface normals, and 
lighting conditions), and finally blending each object pixel with its corresponding target pixel according to 
the objcct> blending factors, if the object pixel is not hidden by a previously rendered, opaque object. Of 
course, much of this procedure is simplified in the '15D + T case. 

Color, transparency, and blending factors are properties of visual objects. Color is specified by a triple, 
(cUci c3), which along with a co5orspace{YUV, RGB. XYZ, etc) determines the object's color in a sundard 
component color model Transparency is specified by a single component, alpha. The color components, and 
alpha, take values in the unit interval [0, 1], although they are usually uniformly quantized to 8 bits of 
precision. Blending factors are the 4-dimenslonal vectors v and p by which the components (cl,c2,c3, alpha) 
are multiplied component-wise in both the source image N (eg, a rasierized object) and the destination 
image M (eg., the current output video frarael before addiiig to form a new destination image P (tg., the new 
output video framel as follows: 

P.cl -clip(N.cl* v.cl H- M.cl •Mdl. 

P.c2 = dip(N.c2» r.c2 + M.c2«fi.c2). 

Px3 = dip(Nx3«v.c3 + Mx3«Ai.c3). 

P.alpha s clip(N.atpha«r^lpha •f M.alpha«fi.alpha ). 

Here. *clip' means that the computed elernents of the destination image, eg.. P.ci, arc restricted to the unit 
interval [d 1]. 

The blending factors y and /i can be independently specified to be one of the following combinations: 

ZERO o 10,0,0.0). 
ONE = (1.1,1,1). 

DST. COLOR « (Mxl.M.c2.M.c3,M.alpha). 
SRC-COLOR « (N.cUNx2,N.c3J^.aIpha), 

ONE-MINUS-DST.COLOR » (1.1,1.1 ) - {MxKM.c2.MxXM.alpha), 
ONE-MINUS-SRC-COLOR = (l.l.l.l) - (Nxl.N.c2.Nx3.N.alphaK 
SRC. ALPHA - (N^lpha.N.alpha.N.alpha,N.alpha), 

ONE. MINUS-SRC-ALPHA « (l.l.l.l) - tRaIpha,N.alpha.N.alpha.N.alpha), 
DST.ALPHA » (M:alpha.M.alpha.M.alpha.M.a1pha). 

ONE..MINUS-DST. ALPHA » (l.l.l.l) - (M.a]pha,M.alpha,M.alpha.M.alpha). 
SRC.ALPHA.5ATURATE - (UXK where f « min{N.a]pha.l - Nalpha* . 
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Common sclcciions arc v = ONE and /< = ZERO, in which case ihc source image is always painted on top 
of the deslinaUon image, or v = SRC-ALPHA and p « ONE-MINUS.SRC.ALPHA. in whidi the 
source image is weighted by the objccl's opacity, and the destination image is weighted by the objects 
transparency. ^ „ 

As previously suted, color, transparency, and blending factors arc propcrucs of visual objects. Hence 
thcv can all be changed In this non-flexible syntax to the extent allowed by individual o^i^c^o'J 
example, in a \idco object, color, transparency, and perhaps blending factors can be changed every encoded 

Many interesting effects can be achieved with non-flexible composition. However, the approach is limited, 
in the above non-flexible composition example, the encoder must explicitly transmit every change in a scene s 
composition, using valuable bandwidth. Groups of objcctt do not share related composition informauon. 
Thus a scene containing thousands of polygons representing animated objects would use an extraordinary 
amount of bandwidth, even if the animation can be naturally described with only a few umc-vaiying 
parameters. Furthermore, none of the examples above involving user Interaction can be achieved with the 
non-flexible composition syntax alone, as it is defined above. The syntax is too rigid to cxprcw the range of 
behaviors that one may desire to provide at the decoder in response to user interaction and other local state 
changes. One solution to the problem is to enlarge the non-flexible Mtsiream syntax to a<xommodatc 
a specific set of behaviors, such as the ones above. However, the range of possible behaviors is extremely 
large. The ultimate solution is to provide a mechanism for the encoder to download code in a programming 
language to describe the desired behaviors. Thb is the approach taken in the flexible approach to 
doniposition. 

3 J. The flexible approach to composition 

In the flexible approach to composition, it is assumed lhat.a script in a flexible programming langiiagc 
such as Java^ or a xariation of it can be downloaded into ihe M PEG-4 decoder. The language will not have 
powerful composition operators built in (as PostScript docs, for example*. Rather, downloaded scripts in the 
language will perform powerful composition operations by passing data through a standardized interface, 
the composition APL This is similar to the flexible approach to decompression (Section 4i 

The composition API constsu of modules, or classes of objects. Each module defines a data structure and 
a coUectlon of methods (Lc functions) for its class of objects, as is typical in object-oriented programming 
meihodolo^ Discussed here arc only the major modules, and the generic functions ihcy pcrfornu The 
composition API, like other M PEG-4 Systems APIs, is under continual development. The latest version of 
the API can be found at the MSDL web site http: / /www-elec . enst . f r /msdl / . 

The central module in the composition API is the Comj)0$itor. The Compositor composites, renders, and 
blends audiovisual objects onto output audio and video frames. Other modules in the composition API 
basically support these functions. * 

The Compositor maintains one audio frame or one video frame for each output channel For example, 
a Compositor that outputs five-channel audio and two-channel (stereo) video maintains five audio frames 
and two video frames. An audio frame is a finite sequence of audio samples: a video frame ts a rectangular 
array of pixels. 

Audiovisual cff^ects are produced frame by frame. Al some presentation frame rate (which may vary 
from one decoder implementation to another, or may even be lime-varyingl. a sequence of instructions 
in the scripting language is sent to the Compositor, These instructions (actually, method callsl produce 
all audio and video frames (one per output channel) in parallel Thus the Compositor produces video and 
audio frames in lock^tep. at the presentation frame rate. By dcfiniuon. the length of the audio frames ts 
adjusted to the interval between video frames. Once the audio and video frames arc produced, a mechanism 
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outside the scope of the API concatenates ibe audio and video frames onto output biiflcrs for presentation to 
the user.^ 

InstrucUons In the programming language to the Compositor arc fairJy low level and fall into three 
categories! 

- instructions to render pre-defined Cht^ standardized) and encoder-defined (i.e. downloaded) audiovisual 
objects, 

- instruclions to change the current state, or *context\ and 

- instructions to save or restore the current state. r » • i 
Sequences of such instrucuons to the Coropoator arc used to produce the video and audio framoL A typital 
sequence of instrucuons lo composite, render, and blend an audiovisual object into a scene mi^l be: »ve the 
current state, modify the state to affect the next audiovisual object to be rendered, render the object, and 
restore the saved state. Such sequences arc repeated for each ol>jcct in the scene. 

The current state, or context maintained by the Compositor includes the followmg elemenu: 

- a coordinate transformation matrix that maps the local spatio-temporal coordmate system of an 
audiovisual object into the global spatio-temporal coordinate system of the scene. 

- a properties sheet thai contains the default color, transparency, blending factors, line width, and so forth, 
for audiovisual objecu as they are rendered, u i k i 

- an anay of 'cameras', which specify the projection and clipping planes Iwith respect to the global 
coordinate system) for each \adeo channel, • - • 

- an array of *microphones\ which specify the location and directional sensitivity of an acoustic smk point, 
fof each audio channel. 

- an input stream, from which audiovisual objects may read data as thcj' are rendered, 

- an output stream, to which audiovisual objects may write data as ihcy are rendered, 

- and a dictionary of attribute/value pairs, which may be used to pass generic information from one 
audiovisual object to the next. < e • ^ i -i 

The coordinate transformation defines the coordinate transformation discussed in Section 3.1, and 
can be represented by the parameters A^A^ ....B^ The properUcs sheet defines the default para- 
meters for any rendering operations that require parameters. For example, the properties sheet con-, 
lains parameters cl.c2,c3. alpha, which represent the color and transparency of objects that do not 
explicitly specify color and transparency, eg., 3D objects without assodatol texture maps. The camwas 
define the precise relationships betu-een the coordinate systems of the logical displays and Uie coordi- 
nate system of the scene. These relationships are represented by coordinate transformation^ The cameras 
also define parameters relating to aspect ratio, clipping volume, etc. The microphones define analogous 
relationships between the logical audio output devices and audio sources in the scene. They also define 
parameters relating to directional sensitivity, gain, attenuation, and so forth. The input stream is included in 
the Compositor state because the contents of the input stream determine how most audiovisual objects 
(audio, video, synthetic animation, and other coded objects) appear when rendered. The output stream is 
included because audiovisual objecu are able to produce oatput information when they are proc^sed. 
Finally, the dictionao* of atiributc'value pairs is a catch-all raedianism for passing generic information from 
one audiovbual object to another. The value of an arbitrary attribute can be set by one audiowual ol^ect when 
it is tendered, and can be read by another audiovisual object when it is rendered. Like 'environinent vanables 
in UNIX, this mechanism is especially effective for broadcasting information across a collection of objects. 



-'Koie iliBUhe audio and video rramet prodiiced.b> ihe Compoiiior «fe prrsrtaaitun framci. and do not io general cormpond lo *ny 
periodic uructum such u frames', thai mar be used «o roJf audiovisual objects. In laci. since ibc prcwnuiion frame »te may vary 
ffom onc decoder to another, or may he lime-varyinf depending on the compuuiional kwd at the decoder; it »s impossible for an 
encoder lo keep track of the presentation frame rate. Tbb U an example oT the architectural separation heiwn the presentation or 
audiovisual objects and their underlying representation. 




O. Arm *t at / SigmU Procttsiag: tmage Commmiauhtt 9 aWi m-4SI ^1 

These state clemcnis are simply objects from different classes in ibc composition API. The Compositor 
maintains one object from each of these classes. Thus, the Compositor maintains one object from a coordi- 
nate transformation dass. one object from a properties sheet class, and so forth. The elements of the current 
Compositor state are therefore manipulable individually, by means of method calls to the individual objects 
in the Compositor For example, instructions may be sent to the Compositor's current coordinate trans- 
formation matrix to modify its spatial or temporal components. This is the way to set the spatio-temporal 
pose of the next object to be rendered 

The Compositor also maintains stacks of objectt to facilitate saving and restoring some elements of the 
current state In particular the Compositor roainuins a sUck for coordinate transformation matrices and 
a stack for properties sheets. Thus these state elemenu can be saved and restored by pushing and popping 
their associated stacks. * • ^ 

The difficult work done by a Compositor is rendering audio^^$uaI objects. The Compositor has methods 
for rendering all pre-defined (i.e., standardized) audiovisual objects, and also has a method for rendering all 
encoder-defined (i.e^ downloaded) audiovisual objects. 

When a method is called for rendering a pre-defined, static visual object, such as an image or a polygon* the 
folloMving is performed. The Compositor performs the 2.SD or 3D spatial coordinate ininsformaiion spedfied 
by the current coordinate transformation matrix, thereby positioning the object in the scene. (The temporal 
portion of the coordinate transformation matrix is ignored, because the object is static or time-invariant.) 
Then for each camera, the Compositor 

r. clips those portions of the object outside the camera's clipping volume, 

- projects the otject onto the camera*s projection plane, 

- rcsamples the object to match the target pixels of the projection plane, 

" computes the color and transparency of each object pixel which may vary according to the object's 
orientation, surface normals, and lighting conditions, and finally 

- blends each object pixel with its corresponding target pixel, according to the source and target blending 
factors, if the objea pixel is not hidden by a previously rendered, opaque object 

Of course, much of this procedure is simplified in the 2.5D case (eg., lighting and surface normals are 
ignored). * • ^ 

When a method is called for rendering a pre^fined, time-var>'ing visual object, such as a video object, 
steps identical to the above arc performc^d. First, however, the time-varying object is reduced to a static 
object This is done by temporally sampling, or 'slicing' the time-varying, 15D -»-T or 3D +T object, at some 
temporal oflsci in iu local coordinate system, to produce a sutic Z5D or 3D object The local temporal 
offset t at which the object is sliced corresponds to global time r' » 0. 

The rationale for slicing time-varying audiovisual objects at global lime r' « 0 is the following. The global 
scene consists of one or more time-varying audiovisual objects. The temporal coordinate system of each 
object has been stretched and delayed, in general by an affine coordinate transformation: 

r«(v4„)*r + (B,) 

that maps the object's local temporal coordinate system into the scene's global temporal coordinate system. 
Fig. 7 shows a local coordinate system that has been stretched (dowed down) by a factor of ^« - 2. and 
delayed by B| » 4 seconds, relative to the global coordinate system. 

A mechanism beyond the scope of this API generates video frames at a presentation frame rate. 
Conceptually, each frame represents the scene, temporally sliced at advancing global lime instants f » 0.0. 
Oil. 0.1 ... , if the presentation frame rate is 10 frames per second, for example. By convention, instead of 
slicing the scene at advandng global time instanur' =0.0. 0.1,a2. all the local coordinate systems in the 

scene are advanced (temporally shifted to the left) by O.a 0.1. 0.2 and then the scene is sliced at global 

time f* « a In this way. the local time at wljich a visual object should be sliced caii be determined by 
inspecting the temporal components of the current coordinate transformation matrix, and. solving ihe 
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Fig. 7. C^ohkl and local coofdiiuu syfienti. 

equation 

0«(iiei}*l+(B,) 

for the local time. 

In sunimary. when a Compositor method is called for rendering a time-vao'ing visual object, the 15D +T 
or 3D+T object is first temporally sliced at local Ume t - -(B.W-4i,k to produce a Z5D or 30 object, which 
can be subsequently rendered as a static object. Static objects are consideried to be 2.5D+T or 30 +T 
objects, which happen to be constant for all time. . . 

Audio objects are time var>'ing by nature When a Compositor method is called for rendering an audio 
object, the audio object is similarly temporally sliced, bul the slices have width. To be precise, the portion of 
the audio object rendered onto the audio frame(s) is given by the interval [r.r +r) in the objeci*s local 
coordinate system, where r and T satisfy 

where 7' is the insuntaneous interval between presentation frames, in the global coordinate system. The 
Compositor 'renders* other pre-defined objecu Csuch as cameras, microphones, and light sourc^l differently, 
depending on the object 'Rendering* such objects usually means registering them in an appropriate way with 
the Compositor, using the current coordinate transformation matrix to determine the pose and directional 
characteristics of the object^ For example, a camera object defines a projection plane and a clipping volume 
in its local coordinate system; likewise, a light source defines a radiation pattern In its lo^ coordinate 
system. When these objecu are 'rendered*, the current coordinate transformation matrix determines the 
direction, scale, and translation of these objects. 

Key to the flexible composition approach are encoder-defined audiovisual objects. An encoder can define 
new classes of audiovisual objects, by downloading the new class definitions in the programming language. 
All encoder-defined classes of audiovisual objects (as well as all pre-defined classes of audiovisual objecu) 
must by definition be derived from an abstract base class called AVObject. provided by the APE. The 
AVObject base class defines a generic public interface for audiovisual objects of all kinds. Since all 
audiovisual objecu must be derived from this base class, all audiovisual objects must provide the minimal set 
of methods specified by the base dass. Two methods in particular arc defined by the base class, and hence 
must be provided by all encoder-defined (and pre-defined) audiovisual objects: 
- render, which takes a Compositor as an argument, and 
* handle, which takes an event structure as an argument 
Of course, audiovisual objects may also provide additional, class-specific methods. 
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The purpose of an audiovisual objecfs render meihod is to render the object onto ihc specified 
compositor. In an encoder-defined audiovisual object for which the class definition is downloaded in the 
programming language, the body of the routine constitutes a script thai describes step by step how lo render 
the object onto the compositor. (The script is executed once per presentation frame by a mechanism outside 
the scope of the API.) , . 

The script may contain an arbitrary sequence of instructions, but at the very least it contains instructions 
to the Compositor to render the object's audiovisual sub-objects in appropriate poses. The script may also 
contain instructions to save and restore the current state, instructions to compute the local time from the 
current coordinate transformation, instructions to obtain decoded information from the appropnaie de- 
coders, instructions to read input, instructions to set and test state variables, branch conditionally, and so forth. 

The purpose of the audiovisual object's handle meihod is to handle events, whether syochronouriy 
generated by a script, or asynchronously generated by user input for example. In an cncoder^lefined 
audiovisual object, for which the class definition is downloaded in the programming language, the body of 
the routine constitutes a script that describes step by step how to handle events. (The script is execute^ once 
per asynchronous event by a mechanism outside the scope of the API>. The script may contain an arbitrary 
sequence of instructions, but at the very least it contains instructions to examine the information in the event 
structure, and lo synchronously pass the event to the event handlers of the object's sub-objects. The script 
may also contain instructions to modify state information in the object This state information can be tested 
in the reader method to influence the way the object is rendered. 

Because rendering an encoder-defined audiovisual object involves rendering sub-objeas. which ran 
•in turn be encoder-defincd audiovisual objects, audiovisual objects arc in general hierarchical. The 
hierarchy consiiiuies a scene graph. The scene graph is dynamic because any audiovuual object 
can conditionally render its sub-objects, so the scene may change from frame to frame. At the root of the 
scene graph is the scene. Thus the scene is iisctf an audiovbual object. This has implicaUons for bitstream 
editing. Two audiovisual objects can be authored in differenl institutions, and can individually play on any 
MPEG-4 decoder. Thca a new scene class can be defined, with the two audiovisual objects as sub-objects. 
The new scene can again be pla>-ed on any MPEG-4 decoder. 'Plug-and-play' of audiovisual objects is 
guaranteed by Wrtuc of their standard inleriacc, enforced by derivation from the AVOb j ec t class. Temporal 
synchronization between audiovisual objects is guaranteed by the temporal coordinate system model 

When an instruction is sent to the Compositor to render an encoder-defined audiovisual object, the 
Compositor does not know intrinsically how to render encoder-defined audiovisual objects, but encoder- 
defined audiovisual objects know how to render ihemselvcs. So when the render method <rfihc Compositor 
is called upon lo render an encoder-dcfined audiovisual object, it turns around and calls ihe object's render 
method, with itself as ihe argument. It is not necessary lo register encoder-defined audiovisual objects with 
the Compositor, for it to know how to render ihem: the Compositor makes use only of the generic properties 
of audiovisual objects. The ability to deal with a variety of different lypcs of objects behind a single interface 
is called polymorphism in object-oriented terminology. 

Although Hhe* Compositor has been described until this point ihc Compositor module is actually a cla^ 
or template for compositor objects. Thus in principle many compositor objects may be instantiated 
simultaneously within a decoder Although one of the compositor objects must be connected to ihe output 
frame buffers for each audio and video channel, the other compositor objects may be constructed with 
arbitrary numbers of audio and video frames. For example, a compositor object may be constructed wiih one 
video frame and no audio frames. 

It is an important architeciural principle that a compositor object's output video and audio frames are 
themselves primitive audiovisual objects, which arc therefore directly renderable onto another compositor 
object's framei. In this way. compositor objects can be networked together, with some compositor objects 
producing intermediate rcsulu for others. For example, one compositor object may be set up to compute 
3D projections, while another compositor object may be set up to compute 2D overlay planes. A final 
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compositor object may simply blend the ovcriay planes with the 3D pro^ions. Such configurauons can be 
set up in the render method oT the downloaded scene. 

3 J. Examples 

The following scripts arc working examples of encodernlefined AV objects from acurrcnt implementation 
of the API. It should be noted that these are only iUustraUve examples. The API is subj^ to change in future 
revisions of MSDU furthermore the syntactic decoding operaticms arc treated here differenUy from what is 
presented in Section 5 in order to make the examples self-contained. 

SJJ. RmnibteVideoObJtct . «. • •oAAr- 

This first example shows a video playing in the center of the display, roiaung counterclockwise at /s. 

When the user poinu to the object, and clicks, the object reverses direction. ^ u w u 

This effect is achie%'ed by overriding the handle roeihoA When a user event is passed to the objert. tne 

handle method checks the event structure to see if the label on the object matches the object s identity. If sojt 
. stores the time and the angle of rotation when the dick occurred. It also reverses the sign of the rotauon speed. 
The render method computes the local lime, and computes the angle of rotation since the last diclUt 

then rotates the object by the computed angle, labels the object with its identifier, and renders the object The 

current state is saved and restored. 

public class ReversibleVideoObjecC extends AVObject ] 

videoObject videoObject » new VideoObject < ) ; 

byte ident - 17; 

double tixne^ angle; 

double timeAtLastClick « 0; 

double angleAcLastClick « 0 ; 

double degsPerSec > 90 * 0; 
public void render (Compositor c) { 
t ime a c . trans form . localT ime ( ) ; 

angle = (time- timeAtLastClick) • /degsPerSec + angleAtLastClick; 

c . pushTr ans form ( > ; 

c.pushProperties ( ) ; 

c. transform. rotate(angle) ; 

c . properties . label ( ident ) ; . 

c . render ( videoOb j ect ) ; 

c.poppropertieso : 

c . popTransf om ( ) ; 

public void handle (event e) { * 
if ( ( (PressEvent) e) . label « « ident) { 
timeAtLastClick » time; 
angleAtLastClick = angle; 
degsPerSec* « — 1; 



public ReversibleVideoObjectCint id) {ident • (byte)id; 
public ReversibleVideoObjectO { \ 
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3J.2, DoubieJUcenibieyideoObJeci 

This second example shows two RevcrsiblcVideoObiccis playing side by side One object rotates and plays 
at 025 the sp^d of the other. The Handle rocibod passes received events to each sub-object, without 
interpretation. The two objects are labded differently, so each wiU reverse itself independently. 

public class DoubleRsversibleVideoObject extends AVObject { 
AVObj ect obj 1 - new ReversibleVideoOb jec t ill); 
AVObj ect obj2 = new ReversibleVideoObject (23 ) ; 
ptiblic void render (Con^sitor c) . | 

c . pushTransfom ( ) ; 

c. transform, translate ( — 60,0) ; 

c .render (Ob j 1 ) ; 

c . trans form . speed ( 0. 2 S ) ; 

c. transform. translate (160. 0) ; 

c. render <obj2) ; ^' 

c . popTrans f orm { > ; 

} 

piiblic void handle (atpeg4. Event e) [ 
ob j 1 » handle ( e ) ; 
ob j 2. handle (e) ; 
I 

\ 

i 



AnimQtedVideoObJtct 

This final example shows a video object whose DPCM-encoded (x^l position is parsed and decoded from 
the input stream. For conceptual simplicity, the render method here makes use of a private decode 
meihcid to decode the (x^) position. The syntactic description can be integrated with the class declaration, as 
described in Section 5. 

public class AnimatedVideoObject extends AVObject | 

VideoObject videoObject » new videoObject (} ; 

intx«0; 

int y «■ 0 ; 
public void render (Compositor c) { 

decode (c . ixiput stream) ; 
. c . pushTramsf orm ( ) ; 

c.transfonn*translate(x, y) ; 

c . render ( videoObj ec t ) ; 

c . popTrans form () ; 

void decode (InputStream is) { 
if (lis.eosO) { 
X*!- i- is.intO) ; 
y+ n is.intO) ; 
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3 J. Futurt work 

A large amount of work remains to be done in the area of MPEG-4 composition. In particular 
" Harmonization with VRMU the Java^ Media API. etc. A standard should unify the marketplace, not 
splinter it. None of the existing standards add rtss all the needs of M PEC-4, but there is substanUal overUp 
with some of them. Work needs to be done to harmonize MPEG-4 with existing standards as much as 

- Adding higher level constructs to the API. The composition API b low-level much closer to OpenGL than 
to VRML, for example. This permits the maximum amount of flexibility to script writers. However, 
authoring scripts is difficult, and moreover, different authors will build different highcr-lcvel functionalities 
ic.gl layout management) into their audiovisual objects. In that case the audiovisual objects will only work 
together at a very low level. Work needs to be done to standardize higher level mechanisms in the API. 

- Specifying more primitives. Work needs to be done to define standard audiovisual objects and 
other primitives, sudi as text fonts, synthetic audio, and other objectt that script authors should be able to 
count on. ! 

- Refining the audio rendering. To date, audio rendering has not been implemented within the context of the 
composition API* The design needs to be implemented and refined. 

- Refining the non-flexible composition synux for different Systems profiles. 



4. Decompression - reconstntctioo of AV objects 

4 J. Key concepis and requirements . 

In previous coding standards, such as MPEG-l. MPEG-2. or H.263. a fixed set of algorithms was specified 
and used to decode synchronized video and audio frames. MPEG-4 intends to cover a much larger scope: it 
specifics some flexible and configurable decompression algorithms for various types of audiovisual data, not 
restricted to audio or video frames. Typically, an MPEG-4 scene may consist of a mo\*ie coded with 
MPEG-i a 3D logo coded with a specific encoder-defined algorithm, and a sound coded with a modified 
version of G.723, In the following sections, the design of a decoder enabling such functionalities will be 
detailed. 

To achte\*e the above-mentioned goals, an object-oriented de»gn of the decoder has been adopted. 
The key concept of this design, as described in Section 2.3. is to separate the audiovisual objects that 
represent the data to be niantpubted in the application (video frames* 3D graphic objects, etc-K from the 
decompression processes, the decoder will thus rely on a set of classes (in the object-oriented sense) called the 
standard class library. This library will provide the necessary interfaces for decompressing a pre-defined set of 
classes of audiovisual data with a pre-defined set of tools. Moreover, the design ntakes it possible to define 
and download new audiovisual primttivei or decompression tools, and to configure tools into new 
algorithms. 

In summary, an MPEG-4 terminal provides the following capabilities regarding to decompression: 
• the use of pre-defined algorithms, such as MPEG-I video or audio. MPEG-2 video profiles or MPEG-4 

audio or video profiles. 
" the configuration of existing tools into new algorithms. 
- the definition of new toots and algorithms, and 

the exchange of various messages with the multiplexer and the compositor. 
These capabilities are eventually Kmited by the flexibility supported by the terminal These limitations will be 
described in the next sections. 
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4 J. The Standard class library for decompression 

The standard class library contains the basic AV objects and process objccu needed by a non-flexible 
application. It also contains all the configurable tools lhat will be used in a flexible apphcaUon (sec Section 2^ 
In this section, a subset of the standard class library, as it is currcnUy defined, is descnbed. The standard class 
library is evolving accoiding to the verification model specifications for audio and video coding (both natural 
and synthetic) in MPEG-». Each audio and video profile of MPEG-4 may use different subsets of the 
standard class library. , i.. j r 

As mentioned in the previous section, the standard dass Ubrary for decompression contains two kinds of 
classes: 

- AVObjects, which are the clcmentar>' audiovisual components that need to be manipulated in the 
application, and j _i v 

- ProcessObjects, which are the decompression tools used by AV objecu to decode themselves. 

As discussed in Section 3, AV objecte define two methods in particular render and handle methodsJ^n 
addition, many of the AV object classes in the standard class Ubrary define specific decode methods. The 
decode method of such a standard AV object decodes the attributes of the object itself: it builds and 
instantiates from a coded representation* all the attributes of the AV object The method is typically 
implemented using high level process objects, but it may be overridden in a flexible environment For 
instance, the decode method of a 3D model may be overridden to just read a VRML dcscnption. or the 
decode method of an image may be overridden to use a newly defined decompression algorithm. 

Process objects perform processus on AV objects, by using an apply method. This apply method 
changes or instantiates the attributes of the AV object being processed. 

The MPEG^ Systems is designed in this way to emphasize the independence of AV objects and the way 
they have been coded in a particular application. Moreover, designing the architecture m this way promotes 
reusability of the tools in various coniextt For instance, a DCT coding tool may be used by MPEG-1/2, 
H J63, JPEG and by any downloaded decompression method. 

AV Objects in the dass library arc likely to include video dtgects, images, audio waveforms, and 3D face 
models, for example. Fig. 8 shows an GMT representation of some AV object classes. 
Process Objects in the standard dass library are organized in two sections: 

- high levd objects, such as MPEG-4 profiles. MPEG-2 profiles, exemplified for video in Fig. 9, and 

- low level decompression tools such as quantizers, exemplified for video in Fig. 10. 

4 J, The non-Jiexibie approach to decompression 

In this section, we deal wiih decompression for a non-flexible terminal. This corresponds to a finite set of 
standardized audio, video, and system algorithms made up of standardized tools. In this section, the concept 
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Of the standard class library is used to enable a eonsUient description of decodeis tor aB levels of fl«s«»»J»y- 
However, it should be understood (hat non-flexible terminals do not require the availabihiy ortfte ci^ 
interfaces in any specific object^riented language. The objcct^riented fonnahsm is "^^^^y^J^^^ 
consUtency of description betv«wi the non-flexible and the flexible approach, just as pseudo^ code is used 
in MPEG.2 for describing iu syntax. Bearing thai in mind. relaUng a non-flewble applicaUon with the 
sundard dass library implies that: 

- only classes of AV objects in the sundard class library are used, 

- each of these AV Objects has a defined standard decompression method, and 

- only decompression tools in the standard process objects Ubrary are used. 

Ut us now take the example of a video object in a mulli^andard non-flesible application m 
standard video decompression algorithms can be used. In that case, the videoOb j ec t etass <>f «»>e sandard 
library will have a decode method that takes into account I^PEC.2 and M PEO^ video profiles, as w^ll as 



all the video object decoders listed in the standard class library. A pre-defined switch could enable the choice 
between these algorithms. The sundard decode method of the video object may look like: 

class VideoObject extends AVObject { 
int algorithnType: 
ProcessObject coder - null; 
Image image » Image ( ) ; 
public void render (Compositor c) { 
' decode (c) ; . 
c . render ( image ) ; 

protected void decode {Compositor c) { 
if (coders =nuir) { 
algor i thraType » cinput stream . uint ( 9 ) ; 

switch (algorithmType) { ... //Construction of an 

case 0: //MPBG2-High Process 

coder - new MPEG2High (c. inputs t ream) j //Object, reading data 



//from c. input stream 



break; 



case xxx: 

coder » new HPEC4Pro£ileN(c. inputstream) ; 
break; 
default: 

Error (* Coder not supported in the standard library' ) ; 
break; 

} . 

coder . apply ( image, c . transform. localTime () ) j / /decode input 

//stream into image 
//at presentation time 
//c. transform, local time (} 

i 

Then, the only difference between an MPEC-4 decoder architecture and a classical MPEG-2 one is the 
9 biu parameter in the header of the biisiream that specifies the standard algorithm to be used (note that the 
syntactic decoding operations here are handled differently from what is presented in Section 5 to make the 
examples $elf«containedy. 

The non-flexible approach for decompression is thus very similar to the classical M PEC.2 like standards. 
The algorithms are selected by a switch sent in the bitstream. The only difference is that a broader set of 
. algorithms is eventually available. 

The non-flexible approach is useful for applications that are satisfied with the fixed standardized set of 
profiles (standardized configuration of tools). Howxver. any time an application needs a new decompression 
algorithms (non-standardized configuration of tools) to satisfy its special needs, cither a new round of 
standardization is engaged, or the flexible approach is used. 



'Here we are tisng a Java-iike detcriptkMi of the code. For porpotcs of clarity and concbcfitit. we only provide picudo code in 
example. %% wcQ at iht loHowtng onct eoaceming dccodbg. 
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4,4. The flexible approach to decompreaiom 

4,4.L Configuration of existing toob into new algorithms i u tv • 

In this section, wc consider that a finite set of standardized audio, vidca and system tools and their 
standardized interfaces are available at the terminal These tools may be flexibly configured mto arbitrary 
algorithms. In terms of the standard class library this means the following: 

- the standard class library can be used as in a non-flexible application. 

- decompression methods for standard A V objects can be overridden by configurations of standard tools. 

and ^ \i 

- standard decompression tools can be used for non-standard (encoder-defined) AV objects. 

In order to fulfill the above requirements, the flexible terminal must have the ability to download new classes. 
Ui us give now some concrete examples to illustrate the various possibilities for using the flexible approach 
to decompression in an MPEG-4 decoder. j ^ . 

In the first example; let us consider a flexible application in which a video object is decoded using the 
sundard decomprcsaon method of the non-flexible application given in the previous seciioa Imapne that 
due to the low bii-rate used post-filtering needs to be pcribrmed in order to reduce the blocking artifacts. To 
do so in the flexible approach, the application' first transmits the definition of a new class, say MyVideo 
Object class, which simply inherits from the standard videoObject class and overrides Us decode 
method in the following way: 

class MyVideoObjecc extends VideoObject | 

protected void decode (InputStrcam is) ;videoObject,decode(is) ; 
PostFilter. apply (this) ; 

I 

I : , ' 

I 

i • . . 

Note here that it is not necessary to re-define the render method of the class, since it is exactly the same as 
the one of iu parent class. This is an example where the independence of the rendering and the decompression 
processes is used. Moreover, ihe standard method for decompressing a video object for the particular profile 
used did not need to be redefined, since the decode method of the parent dass was called. 

The scene (top-most AV Object in the application) makes then use of the KyVideoObject class. 
A typical render method (see Section 3 for more detail) of a scene displaying two videos, one filling the whole 
screen (using thb new decoding process) and one in the lower left corner of the screen (uring the standard 
decompression process) will look like 

class MyScene extends AVObject ( 
KyVideoObject videol; ' 
VideoObject video2; 
public void render (Compositor c) { 
c . pushTrans form ( ) *; 

c. render (videol) ; // equivalent to videol .render (c) ? 
c. transform. scale ( 0.25) ; 

c. render (video2) ; //equivalent to video2 .render (c) ; 
c . popTr ans form ( ) ; 

I 

I ■ 

I 

In this example, the MyVideoObject class definition needs to be sent before the data. No additional 
ovxrhead is needed in the bit stream. 
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A second context for ihai kind of flexible application is the case in which a new AV object class is cieaicA 
and iu decode method just makes use of exisdng tools in the standard library. 

Ul lis consider a simple example ofa two-dimensional polygon object, for which vertices are coded two by 
two, using a simple LBG quantizer in four dimensions, with a fixed codcbook. The class dcfiniUon with the 
associated decode method may look like 

class 2DPolygon extends AVObject { 

ListOfPoint vertices » new ListO£Point< ) ; 
Codebook siyDlctionary » new Codebook(xi • yJ . x2, y2, . . . ) ; 
// Defines the fixed Codebook for this particular example 
Quantizer quantizer - new LBOQuantizsr(myDictionary. . . . ) ; . . , 

/ / Builds an LBG quantizer with a given dictionary and relevant parameters 
' Event Source eviencs » new EventSource(4) ; 
public void render (Compositor c) { 

decode ( c . inputs tr earn) ; 

c • render (vertices) ; 

i' 

protected void decode (Input St ream is) { 

int numberOfVertexPairs is.uint (5) ; // read number of vertex 

// pairs 

while (numberO£VertexPairs- - >0) { 
quantizer.apply(events,is) ; 

// The apply method of the LBG quantizer reads the quantized 
// vectors 

• // from the inputstream and put the result in the event 
// object 

vertices-lnsertPoint(event8,elem(0), events. elem(l)) j 
vertices . Inser tPoint ( events . elero ( 2 ) , events . elem (3 ) ) ; 

I 
I 

I 

I 

} 

In the above example, an alternative possibility would be to define a new process object Polygon- 
Decoder « and make an explicit call to this process object in the decode method of the 2DPolygon object 
This would be done if the PolygonDecoder process object had to be re-used in another context in the 
application. Note also in the above example that the class definition is using four classes of the standard d^s 
library, namely the LBGQmntizer, ListOf Point, EventSource and CodeBooX classes. The main 
advantages for using the standard class library here are the following: 

- the overhead of sending class definitions in order to run the application at the terminal is reduced, and 

- the implementation of (he standard class library is optimized for the terminal and is usually more efficient 
computationally than a douidoaded version of the same library. For this reason, only the interfaces are 
standardized. 

Concerning the terminal, the follo^^ing elements are required: 

- a locally implemented and optimized version of the class library, and 

- the ability to doMv-nload new classes and link them with the standard class Ubrary. 

The standard class library will be called any lime a computationally expensive tool is required. If the decoder 
is powerful enough to download full tool definitions or algorithms, decompression may be reconfigured as 
described in the following section. 
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This kind of flexible tcrniinal enables downloading complex audiomual scene definitions, wiih specific 
ireatment of inicractiviiy (see Section 3 on compositionl and defining new AV objects with associated 
decompression methods. 

4.4.2. The dejuddon of new tools and algorithms ^ ^. m a 

We describe in this sccUon a possible extension of the flexible approach defined m MPEG-4. We consider 
a terminal with a standardiied mechaniwn to describe arbitrary algorithms made of arbitraiy tools. From the 
terminal point of view, the following capabilities are required: 

- the standard library can be used and extended as in the previous flexible environment and 
new process objecu can be downloaded and used efficiently to enhance the library of tools. 
Compared to the capabilities detailed in the previous section, ihe terminals have now the ability to download 
and use efficiently decompression tools. From an architecture point of view, there is no major difference wiih 
the previous approach. However, a practical implementation of this approadi that will ensure satisfactory 
interworking between heterogeneous terminals is not yet foreseen. 

Taking the 2DPolygon$ example of Section 4.4.1, aii M PEG^ content provider may like to use, for coding 
efficiency reasons, its own quantizing tool to code 2DPolygons vertices. Ul us take the example ol ^\>^ 
lattice vector quantizer tool which has not been included in the standard class hbrary. Two classes need to be 
transmitted: 

class D4LatticeVectorQuantizer extends VectorQuancizer { 
public void apply ( BventSource events. InputStream is) { 
int rho » is . int (8 ) ; 
int cheta « is.int(8) ; 
// rho and theca are the result of the D4 lattice VQ 
// here the ii^puc parameter of the inverse quantization. . . 
The code of the D4 Lattice VQ inverse quantization procedure. . . 

} 

\ 
I 

Then, the new Polygon2D class using this new tool in the decompression procedure is sent: 

class 20Polygon extends AVObject { 

ListOf Point vertices « new ListOfPoint () ; 
Quantizer quantizer * new D4LatticeVectorQuantizer ( ) : 
EventSouree events — new BventSource <4 ) ; 
public void render (Compositor c) { 

decode (c • inputstreaA) ; 

c. render (vertices) ; 



protected void decode (InputStream is) { 

int numberOfVertexPairs » is.uint (5) ; // read number of vertex 



// pairs 



while (numberOfVertexPairs-- > 0) { 
quantizer. apply (events, is) ; 

vert ices . InsertPoint (events . elem( 0) . event s . elcm ( 1 ) ) 
vertices . Insert Point (events . elem( 2 ) . events , elero( 3 ) ) 

} 
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Note here that the inverse lallicc vector quantaation tool may be used in other contcxU in the same 
appKcaaon. lo decode other kinds of AV objccu. This b the reason to download ii as a separate process 
object Note also, that polymorphism has been explicitly used in this example. The LBGQuaatiaor and the 
D4Latt:iceV«ctorQuanti2er are both derived from the abstract Quantizer class. They arc all using 
an apply method with the same signature. The advantage of doing so, is that the code for ihc decode 
method of the two polygons is exactly the same. Only the declaration of the quaniiier has changed. 



4J, Future work 

Concerning decompression, the non-flexible approach is feasible today almost without changing the 
current architectures of audiovisual terminals such as MPEG-2 decoders. But the impact in terms of 
funcUonalities and applications is Umitcd. since only predefined dccompreswon algorithms may be used, for 
preniefined audiovisual objects. u- • u 

Rcxible terminals proWdc for the abiUty to download descriptions of new classes of data (AV objecuK with 
associated decompression methods, using locally optimized decompression tools. Such flexible terminals 
enable content providers to adapt the decompression processes to the specific data they arc dealing with, by 
reconfiguring the algorithms with standardized tools. Depending on the technology and market needs, the 
flexible terminals may be extended to allow downloading of toots. 

Flexibility enables manipulation of much richer content than in previous standards. However, to achieve 
this, much work remains to be done to completely specify and validate the decompression interfaces for 
natural and synthetic audiovisual objects. 

& Syntactic decodii^ - interpreting the bits 
5./. Rationale 

. The Syntactic Description Unguage (SDL)* part of MSDL addresses the need to disengage the definition 
of the bitsucam syntax of MPEC-4 content from the decoding and rendering tools. This requirement 
originates from the fact that a given syntax specification may be decoded using different implementations of 
the relevant algorithms. This certainly has been the central theme in the MPEG- 1 and MPEG-2 series of 
specifications, even though the syntax specification utilized both formal and non-formal techm'ques (ic, it 
included explanatory text without which the definition of the syntax would be incomplete). 

The separation of bitsircam parsing from the other decoding and rendering steps provides content 
developen with the capability to create customized bitsircam structures to suit their specific application 
needs. For example, it allows content developers to modify the size of parameters present in the btutream 
without the need to perform any modification in the decoding tools. Ii also allows the introduction of new 
bitsircam parameters without losing backwards compatibility with older tools: the new parameters will be 
ignored by older tools, but they can be used by their newer versions* 

This separation also promotes an open approach in terms of biutream definition, in whiph content 
developers may wish to publish their low-level syntax but not their proprieury processing algorithms. This is 
especially imporunt for flexible systems, and is in line »ith the implementatipn-independent approach of 
MPEG-t and MPEG-l Terminals who are not equipped with the appropriate decoding tool wiU just parse 
and discard any corresponding data. Alternatively, diflerent implementations of the decoding tools may be 
used, developed by a third party. 



*SDL is unreUieJ to The Speciftcaiion »fid Description Uafuafle* (ITU-T ZMti »Ktcl» U lucd in the i«lecommu»ic»tioM field. 
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An additional imponant benefit of separating biuircaro parsing from processing is thai the task of the 
bitsirejun arehitect (or content developer! is greatly simplified Focus U drawn on the important tasks of 
decoding information and preparing it for presentation, and not in the mundane task of obtainmg bits from 
a bitslreara. ThU is an underlying theme in all typed programming languages where a set of sundard types 
(chars, ints. doubles) are directly provided by the language, without requiring direct manipulauon of their 
representation by the programmer. The language compUcr or interpreter is responsible for ensiinng that data 
manipulatfons (including conversions) are consistent with type declarations or their dcnvatives (extended 
iypts). These languages, of course, do not take into account that the data may be obtamed from a bitstream. 
but the same concept applies. . 

Introducing a formal mechanism to specify bilstrcam syntax also has significant benefits in terms of the 
accuracy of the specification. Taking ASN.l [8. 9] as an example, ihc specification of a binary wicodcd 
structure is performed via a formal language. The binary representation is unambiguously determined by 
a set of encoding rules [9]. Unfortunatdy, ASN.I cannot be used within the context of MPEG^ because of 
two drawbacks: (I) it was not designed to address the intricate bitstream structures produced by sophisti- 
cated source coding, but rather mudi simpler packet-like structures. (2) it cannot be iniegraicd with 
a programming language. The latter is necessary in MPEO^ as in addition to the bitstream syntax, the 
standard and content developer wUl need to also specify a facUily to peruse the defined decodmg and 
composition APIs. With a welMefincd SDU the exact text of the syntax specification can be used unchanged 
to drive software tools that produce the entire parsing/^it generation skeleton of a dccoderencoder, as well 
as automatically generate compliance testing tools. 

Separation of the bitstream syntax from decoding provides for automatic compliance vnxh ihe overall 
bitstream architecture that MSDL will define (multiplexing level). The alternative approach would put the 
burden of complying *ith bitstrcam-lcycl object delineation and naming on the conicnt developer, whereas 
SDL can proxnde automated fadlitics that prohibit misukes or warn about potential conflicts. An example « 
the process of compilation, which produces a binary file with an appropriate header that identifies it to the 
operating system as an executable (the- 'magic* number). 

From an implcmeniaiion perspective, a formal declarative dcscripiion of the bitstream syntax pro- 
vides the uunost flexibility. The description of the syntax should indicate ulutt the bitstream contains, and 
not how to obtain it. If the syntax is Implidily defined as a series of piwessing steps deeply embedded into 
a decoding tool that capability is completely lost The declaratiw approach allows a very broad spectrum of 
hardware and sofin-are implementations^ and provides for market differentiation by allowing crcati%-c 
competitiveness. 

From a general architectural perspective, independent programnwbility of bitstream parsing is a natural 
extension of the notion of programmable clients. In the software architecture field, the idea of downloading 
platform independent code and locally executing them in a client is by now well entrenched after the 
proliferation of Java. Such an arehitecture. howver. assumes to a large extent that the executing program 
exists isolated from the rest of the world. In the communicattons-oriented environment that MPEG-* is 
addressing, in addition to a downloadable executable program, there is also a notion of a continuou^y 
trditsmttted sueam that is received by the dieni for further processing. It is natural, then, to extend the 
programmability aspects lo the treatment, of the received bitstream. As described later oa thb is elegantly 
integrated with the ovxrall programming methodology of MPEC-4. 

A potential drawback arising from the separation of parsing is that context information linformation that 
afTccis the parsing of the coming hiis) has to be verj- close lo the bitstream level. If a technique relics on 
extensive decoding in order to obtain the value of a particular conditioii. then in order to parse the biutream 
extensive decoding has to take place. In such situations, the condition has to be expressed in the btutreanu 
rather than be inferred from decoding. Examination of current state-of-the-art specifications does not reveal 
instances where such high-level context is used. In addition, the potential overhead for adding biisiream-levcl 
information in these cases should be expected to be rather small 





415 



SDL has been designed sa that it can describe exisiing audiowual coding standards. SDL is also an 
easy-io-rcad way of defining syntax spcdficaiions, since it is based on a set of well-defined elcmcnu with 
unambiguous semantics. It has been successfully used to describe MPEO-2 Video profiles, while work is 
underway so that all the MPEG-4 specifications, including the current Verification Models, arc converted to 
use SDL. 



SDL and the MSDL environment 

. MSDL proWdcs the overall programming environment in which MPEG-4 content is developed. In 
addition to the class structures, or APIs, that are defined for decoding and compositing, it will also include 
a programming facility through which these APIs wll be excrrised (Java is currently being used). SDL is 
being designed as an orthogonal component of this language. Orthogonality here implies that the two are 
independent: the spedficadon of SDL does not affect the programming constructs and vice versa. There ar^ 
of course, some common basic principles assumed, at the level of the capability to define data structures and 
object hierarchies. 

In its current form. SDL (in iu textual version) assumes a C++yJava-Uke approach as the central 
theme of the MSDL programming language. It then proceeds to extend the typing system by providing 
facilities for defining bitstrcam-level quantities, and how they should be parsed. SDL can be seen as 
generalizing the concept of declaring constants u-iih hard-coded values, to that of declaring consUnts that 
obtain their values from a bitstrcam. Similariy to traditional constants where a programmer is not concerned 
how the initialization is performed, but needs to assume that it is perfomcd before tte variable is accessed, an 
SDL programmer can assume that a constant is parsed from the bitstream before it is accessed. As described 
later on. parameters are parsed only once, and hence the bchauor is indeed \xry similar to traditional 
constants. 

The interface between SDL and the overaU MSDL architecture is therefore weO-dcfincd. In addition, it 
poses no restriction to the indi\idual structures of SDL and other MSDL components. The rule of 
guaranteeing that a variable is parsed before it is accessed is the most general one. Simpler rules could also be 
adopted: an example would be to mandate that a variable is parsed when an object is instantiated. 
Alternatively, parsing can be triggered by a specific API method, although this may significantly limit the 
iroplcrocniation flexibility of the decoder. The optimal methodology is currently investigated by experi- 
mental implementation that can accurately expose the benefits and drawbacks of each approach. Note that 
regardless of the method used, the SDL specification is not affected in any way. 

The key characteristics of SDL within the MSDL environment can be seen by considering the following 
simple example of the definition of a trivial object in C+-^/Java. 

//Ci-+/Java 
class simple { 
int alpha; 
void decode ( 1 ! 
alpha a in.gotinc (3) ; ' // explicit parsing 
alphas » IS; // decoding 



Here the bitstrcam (in) contains the value of the parameter alpha in 3 bits, which is then subsequently 
decoded by a simple offset adjustment. The same object in SDL would be defined as: 



It 



// MSDL 



416 



a Ataro tt d. Stptai Procesthg: image Commumcetion 9 (i997i SSS'^Si 



class simple { 
incO) alphas 
void decode ( ) { 
alpha + a IS; 



// declarative parsing 



// decoding 



The coMirurt int(3) indicates ihai the parameter alpha will be parsed as 3 bits and then converted to an 
int. The fundamental programming constructs remain unchanged, but with the added benefit of clearly 
separating the parsing information. Moreoven all ihc necessary information (data types, parsing information* 
and object methods) is conveniently defined in a single place. 

With respect to flexibility, and since SDL addresses the definition of the fundamental bitsiream syntax* it 
can potentially be applied in both flexible and non-flexible terminals. Qearly, incorporating a programmable 
syntax parser even of medium complexity has a non-trivial impact on the overall cost of the urminaL As 
a result due to the desire to keep non-flexible implemenutions at low complexity and cost, syntax 
programmability is currently consiikced primarily for flexible terminals. 

With respect to the communicauon of the syntax to the MPEG-4 terminaL a binaiy format wOl be used. 
Taking the parallel of Java, for downloadable object definitions the syntax specification becomes an additional 
part of the downloadable class information. In MPEG-4, however, this information- can also be downloaded 
independently from the other components. This aflows redefinition of the syntax without modifying in any way 
the implementation of the methods of the particular dass (tools). As will become evident in the next section, this 
is easily achievable as long as exisdngdata types and interfaces of the particular class remain the same. The 
binary format for SDL is currently being investigated and will not be described here. 

Note that the complexity of parsing MSDL code is no more complicated than a regular C++ or Java 
parser. For simulation purposes, the MSDL code can be processed by a translator that generates equivalent 
C+ +/Java code (an approach which is currently pursued). The translator can directly gcr^rate paring code, 
eg., as part of the constructor, or in the future generate the binary parsing information that would be 
transmitted to a terminaL 

The following describes an overview of the features pro\-ided from the current SDL specification. Due to 
space limitations, several details arc omitted; the reader is referred to the text of the specification for a more 
thorough description [5]. More information can also be obtained from the MSDL Web site: http : / /www- 
elec. enst . £r/insdl / . 



5.3. OrervhwofSDLfeafum 

SDL directly extends the C-like syntax used in the MPEG-I and MPEG-l Technical Reporu imo 

a well-defined framework thai lends itself to ob^t-oriented data representations and machine translation. 

The biutrcam syntax definition features of SDL are described in the form of formal grammar rules. 
.^Elemenury constructs are first described, moving to composite syntactic constructs, arithinetic and logical 
^expressions, and finally address syntactic flow control and functions. Syntactic flow control is needed to take 

into account context-sensiii\*e data. Se\-eral examples are used to clarify the structure, primarily based on the 

MPEG-2 Video International Standard. 

SJ.L Elementary data types 

SDL identifies the following dementary syntactic elements: 
I. Constant-length direct representation bit fields or fixed length codes (e.g.. temporal- reference), These 

include the encoded value as it is to be used by the decoder. 
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1 Variable length direct rcpresenUiion bit fields. These are codes for which the length is determined by the 
context of the biistrcam (eg. a syntactic field whose length is deicnnined by the value of a previously 
parsed syntactic field). . j u i 

y Consunt-Iengih indirect representation bit fields (e^ chroma-format or coded-block.pattemj. These 

. ' require an extra lookup into an appropriate table or variable, or some algorithmic processing to obtam 
the desired value (eg, coded-Wock-pattem). • * 

4. Variable-length indirect representation bit fields (tg. Huffman codes for DCT coefficient run^lenglhs). 

SJ.U. Constant-length direct representation bitfields. These can be simply represented as: 
Rule I 

[aligned] typelilengihf] eiement^name [-fifl/uf]; 
// C-K-f •style comments allowed 

The type is any of the famiHar C/C++ fundamental data types (signed/unsigned mt, char. ctcK with the 
addiiion of -bit' for raw data, length' indicates the length of the element in bits, as it is stored m the bitslrcam. 
This signals the bitstrcam parser to read the specified numb» of bits, interpret them according to the 
specified type, and place them in the memory area associated with the particular variable name. 

The value attribute is only present when the value is fixed (eg. start codes or object IDs), and it may also 
indicate a range of values {i.e., "0 x 01-0 x AP). The type and the optional length arc always present, except if 
the data arc non-parsablc. ix, ihey are not included in the bitstream. The attribute 'aligned' means that the 
data arc aligned on a byte boundary. As an example, a start code would be represented as 

aligned bit (32) picture.start -code « 0x00000100; 

An optional numeric modifier, eg, aligned(32). can be used to signify alignment on other than byte boundary. 
For example, an entity such as temporal reference would be represented as 

unsigned intCS) temporal -reference; 

where -unsigned int(5r indicates that the element should be interpreted as a 5-bit unsigned integer (by default 

with the most significant bit firstji 

Note that constanu are defined using the 'const* attribute: 

const int SOHE.VALtJE » 25S; 
const bi t ( 3 ) BIT. PATTERN = 1 ; 

// this is equivalent to the bitstring '001' ' 
To designate binary values, the "Ob* prefix is used, similar to the 1h' prefix for hexadecimal numbers, and 
a period ('.') can be optionally placed every four digits for leadabilit)'. Hence 'bit(8) OxOF is equivalent to 
ObOOOO.1111. 

SJ, U. Variable length direct representation bit fields. This case is covered by Rule U by allowing the 'length' 
field to be a varia We included in the bitstream. or an expression involving such a variable. For example, 

\insigned int (3) precision; 
int (precision) DC; 

5J. / J. Constant-length indirect representation bitfields. Here, in addiuon to the actual element, one ne^ to 
define how it is mapped to obtain the actual values that the decoder will use. This can be accomplished by 
defining the map itself: 
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Riile2 

nap MapNaine (output-type) { 
index, {valucwl. ... value.M}, ... 

}: 

The input type of such a table is always "bit". These tables arc used lo translate or map bits from the 
biutreaxn into a set of one or more values. The output^ype entry is either a pre-defined type or a defined 
class. The map is initialized with pairs of keys aiid values. Keys aft binary siring constants while values arc 
outputuype constants. Values are specified as aggregates surrounded by curly braces, similar to C or C++ 
structures. 

. Here follows an example, noting that the precise definition of classes is given in the next section: 

class Yuyblocks { 

unsigned int Yblocks; 
wsigned int tlblocks; 
unsignod int Vblocks ; 

\ 

//a, table that relates the chroma format with the 

// number o£ blocks per signal component 
map blocks-per.con^nentCYtJVblocks) [ 

ObOO,{4,l,X}, // 4:2:0 

ObOl, |4,2,2|f, // 4:2:2 

OblO. {4,4,4} // 4:4:4 
I? ': 

The next rule describes the use of such a map in the declaration of a variable. 

Rule 3 

type {MapSame) name: 
The type here is the output type defined in the map MapSame. Example: 
Yt7Vblocks( blocks .per -component I chroma- format;* , 

Using the above declaration, one can access a particular value of the map using the construct 
chroma. format .Ublocks . 

SJJ.4. Variable Length Indirea representation bit fields. For a variable length element utilizing a 
Hufiman tabic, a similar declaration is used: 

int(table) ac.dct.coef f icient ; 

The definition of the table is done in exactly the same way as described before, but here the key entries have 
variable lengths. For example: 

class val [ 
unsigned int £oo; 
int bar; 

I 

nap sample -vlcmap (val) { 
ObOOOO.OOl, {0,5}, 
ObOOOO.OOOl. (1,-14} 
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Very often, V LC tables arc incomplete: due to the large number of possible entries, it is inefficient to keep 
using variable length codewords for all possible values. This necessitates the use of escape eodtt that signal 
the subsequent use of a fixed-Icngtb (or even variable length) repicsentaUon. To allow for such exceptions, 
parsable type declarations are allowed for map values. This is illustiaicd in the foUowmg example: 

mapsaiaple-map-with-esc (vie, val) | 
0000. 001« {0, 5}, 
OOOO.OOOl, {I. -14}, 
0000. 0000. 1« ;S, int(32>}, 

0000. 0000. o< jo. -2o; 

As written above, when the codeuord ObOOOO.OOOO.1 is encountered in the bitsireara, then the value ST is 
assigned to the value of the first element (valiool while the following 32 bits mi\ be parsed and assigned as 
the value of the second element (val.bar). Note that the order is significant Using this construct, the complete 
behavior of the VLC mapping is described in a concise manner in a single place. 

JJ.i. Composite data types 

5 J,IL Classes, Equipped with the above definitions for fundamental types, the definition of composite 
types or objects is now examined. A very useful feature is lo be able to immediately identify the type of object 
dealt with: objert identifiers are then a particulariy attractive feature. In se\-cral cases, the desire for bu 
efficiency precludes their use {this b the case in MPEG-2 below the slice level). The definition of a composite 
object can then be expressed as: 
Rule 4 

(aligned) class object^ame (extends parcnt.classl t: 

bit (length) [iduiame] « object Jd | id-range] { 
[clemenu ... ] jl zero or more elements 

The different elements are definitions of elementary Wtstream components as described in the previous 
section, or flow control that is discussed later on. The object Jd is optional and if present is the key 
multiplexing entity for individual objects. The id^ange is specified as start Jd^ endJd, inclusive of both 
bounds, to express that the object can hav^ a range of possible IDs.* The optional Ms parem,ctas$' indicates 
inheritance. 

. Example 

class slice: bit (32) slice«start«code « 0x00000101. . OxOOOOOlAP | 
... // here vertical.sixe_ex tens ion is get, if present 
if (scalable..inode a » DATA. PARTITIONING) { 
unsigned int (7) priority Jt)rea}cpoinc; 

); " 

The order of declaration of bitstream components is important: it is the same order in which the elements 
appear in the bitstream. Objects can also be encapsulated within other objects. In this case, the eUtnent 
ntentioned at the beginning of this sectioa is an object itself. 



*Noie ihii btfi^ tevd objects tMcome *Tiible at ike muhipkx Uycf. Ai ihi» time, ii has net bet d dcciJcJ if ihe otject ID %ynu% 
described here wiQ be used for the rotUiipIc* Uycr a> wcU. 
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5.U.2. Paronmer types. A parameter type defines adrnwiUi parameters. This a to addrew cases 
dau structure of the dass depends on variables of one or more other objects. The concept of ^ameter types 
is consistent with the key design principle in SDl, i.e. a declarative approach that aUows condiuenal parsing 
in specifying syntax data structures. In fact, parameter types are a very intuitive feality. and. once defined, 
can be seen to be widely needed in resolving cross-references between objects. The syntax of a class with 
parameters is: 

RokS ^ , r 

talignedl class object-uame [(parameter list)] (extends pareni-ciassi L» 

bitOength) [id-name] = objecLid I idjangc] { 
[element: .J II zero or more dements 

h 

the parameter list is a list of type name and variablrnaroc pairs separated by commas. A class tlut uses 
paramcicr types is dependent on the objects in its parameter list When instaniiattng such a class mto an 
obQCCt, the parameters have to be insuntiaicd objecu of their corresponding classes or types. 

Example 
class A { 
// class body 

uint(4) fornac; 

}; : 

class B(A a. int i) {// B uses parameter types 
uint(i) bar; 
• • • 

class C 
{int(2)If 
A a ; 

B(a* Z) £oo;// instantiated parameters are required * 

h 

5.3J. Arrays 

Arrays are defined in a similar way as in C/C++ , Lc. using square brackets. Their length, however, can 
depend on run-time parameters such as other biuueam values or expressions diat involve such values. The 
array declaration is applicable to both elementary as well as composite objects. 

Example 

unsigned inc (3) lengths- 
unsigned int(5J elements; 
intdengthl array [elements] ; 

Hcrc» the length of each elemeni is obtained, as well as the number of cicmentt from the bitstream. and then 
all the elemenu are r^d into array. SDL also allows incremental declaration of arrays; for more details, we 
rder the reader to the MSDL Working Draft [5]. 

Arilhmetic and hgteai expressions 
All standard arithmetic and logical operators of OC++ are used as defined in these languages, mcludmg 
ihdr precedence rules. 
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i.JJ. s^on-parsabie variables w*j- i 

Id order to accommodate complex syntactic constructs, in which context information cannot be directly 
obtained from the bitstrcam but is the result of a non-trivial compuution, non^parsablc variables arc 
aUowed These naturally follow the regular C+ +/Java scoping rules, and have the scope of the class in which 
ihey are defined. Note that non-parsablc. variables are naturally needed in flexible implementations, in which 
new object and method (tool) definitions are allowed. 



JJ.d. Syntactic flow contra! 

Syntactic flow control provides constructs that allow conditional parsing, depending on context* as weU as 
rcpeUtive paning. The familiar C++/Java if-ihennclse construct is used for testing conditions. Note that 
since syntactic flow control occurs at the scope of class declarations, it can be easily distinguished by 
a software iransUtor as referring to the syntax specification. Such structures arc only allowed within method 
definitions in regular programming languages such as C+ ^ and Java. 
The following example illustraies the procedure. 

map some^vlc.table (unsigned int) •{ 
ObO. 32. 
OblO. 33, 

class conditional.objecc | 
\insigned inc (3 ) f oo; 
bit(X) bar-£iag; 
if (bar.flag) * 

unsigned int (8) bar; 
} else { 

vlc(soxne_vlc.table) bar; 

9 

unsigned int ( 32 ) more.£oo; 

! . 

Here iwo different representations are allowed for bar, depending on the value of bar.f lag (note that the 
VLC in this case must return entries of type unsigned int. so that there is no conflict with the previous 
definition of bar). Therecould equally well be another entity instead of the second version (the variable length 
one) of bar (another object, or another variable). 

In the same category of context-sensitive objects are the so<alled repetitive objects. These simply imply the 
repeUtive use of the same syntax to parse the btutream. until some condition is met (it is the conditional 
repetition that implies context, but fixed repetitions are obviously treated the same way). The familiar 
structures of *for\ *while\ and *do* loops can be used for this purpose. 

To facilitate bitstream-level conditional tests, the following notation is used. 

Rules 

The construct ^bitstring]* is a test condition that is true (non«zero) if the next bits present in the input 
bitstream are equal to bitstring. The construct tbitstring*]' performs the same operation, but if the string is 
found, the bits are removed from the bitstream. - 
The SDL specification defines several other minor constructs* which are omitted for brevity. 
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5,4, Fumre wvrk 

In order to assert the power and completeness of SDU descriptions of several dificrcnl audiovisual coding 
standards are pursued. Furthermore, work is underway to convert the entire MPEG-4 sen« of speafication 
to utilize SDL for bitstream representation, thus helping to further refine the language. This procca wfll 
ensure that all the necessary components are available for the codec designers arsenal In addition, software 
translators are being developed that produce C++/Java code from SDL code. These tools are necessary m 
order to experiment with the overall SDL architecture, ensure integration with the overaU MSDL environ- 
ment and APIs, and solidify the syntax and grammar rules. Finally, reconfigurauon of syntactic dcwdtng 
based on SDL is also being developed in order to assess iu potential utiUty within the context of MPEG-4. 

6. Synchronization and mohiplexing - managing time 

The previous parts of this paper developed an object^ricnted model of audiovisual information and its 
processing A coded representation of a scene formed from a set of AV objccu that is generated according to 
this architecture finally has to'bc prepared for transmission or storage in a manner suitable for $>tichrontzed, 
in most cases real-time, delivery and decoding. This representation of the scene is called a session, 
emphasiang iu potentially interactive character and its finite duration. This part states the requtrcmenis on 
synchronization and multiplexing followed by proposals for a system decoder model, an elementary stream 
interface, and a two layer multiplex approach. 

6. /. Requirements an synchronization and multiplexing 

A wide range of requirements has been identified by the MPEG-4 project. Many of them arc related to the 
achievable Quality of Service (QoS). They include 

- support for a large and time varying number of concurrent data streams. 

• support for variable bit-rate data streams. 

- synchronization of data streams, 

- prioritization of data streams. 

- low and deterministic end to end delay, 

- bounded amount of multiplex jitter. 

- reliable, low bandwidth control channeb. 

• low residual error data channels, and 

• reliable transmission error detection. 

Different AV objects may have different requiremenu while belonging to the same application, eg. if 
real-time communication and data base access are combined The data for these concurrent sessions may be 
stored or transmitted joinUy, Even when considering only a single AV object, it is desirable to have better 
error protection for crucial header information than for the remainder of the data, 

MPEG-4 can be used in rathcr reliable communication channels like LANs as well as in error proiie 
enviionmcnu where reliable re-synchronization and other error protection features are very important. This 
suggests having such features configurable according to the application or ewn according to the needs of 
individual data streams. In general, it should be possible to exploit the relevant capabilities of the underlying 
network. 

The amount of ox-crhead must be adaptable to the bit-rate of the application. A low bit-ratc mobile 
multimedia communication application may not tolerate 1 kbps for multiplex aiid stream-related informa- 
tion, white this may be no problem for a similar. LAN-based application. 
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Furthermore it is desirable for an appUcaUon to have a similar inurfecc to a broadcast and a point to point 
transmission channel, obviously without a return channel in the former case, Bitstream editing to 
some AV object data must be possible to some extent without parsing into an elementary strwm. This 
requires easHy accessible descripUve information about the content of the multiplexed bitstreara Additional 
requirements indude con<Uiional access support, feasibility of rc-multiplcxing, and easy intcrworkmg m 
heterogeneous networks. 



6,2. System decoder model 

Tbe purpose of the System Decoder Model (SDM) is to provide a simplified description of the location and 
behavior of buffers in the system and the definition of timing information- With this model an encoder is able 
to monitor the decoder buffer state and knows how synchroniration of dau streams is achieved The model is 
an adaptation ofihe System Turgct Decoder described in MPEG-2,^y5tera$ [7]. An additional model of the 
execution umc for the various functions of an MPEC-4 decoder system would be desirable, however, it is out 
of scope of this paper* 

6.2.1, System buffer mode! -i u 

Bits corresponding to coded enutics that are to be presented simultaneously do not necessanly have to be 
inrclose neighborhood within a multiplexed data stream. A system buffer model Is used to quantify the 
allowed range of this multiplex jitter as well as the induced delay. 

The SDM includes the demulUplexcr, adapUtion layer (AL) decoders, buffers for encoded data for each 
elementary stream (EBL the AV object decoders, buffers for decoded data for each AV object (PB) and the 
compositor, as outlined in Figi 1 1. Note that in general more than one elementary stream may be connected 
to a single AV object decoder. e.g. for scaleable coding, as indicated wth AVO-DecZ while there is always 
exactly one PB associated to each AV object decoder. 

Processing times in all SDM elements are assumed to be zero. EB buffers are filled at the true rate of llw 
incoming data for this stream. All incoming data are partitioned in access units (AU). AUs must have implicit 
or explicit decoding and presentation times, At its decoding time an AU is instantaneously remo\-ed from EB. 
decoded and put as a presentation unit tPU) in the associated presentation buffer. The PU may be accessed 
. multiple times by the compositor as long as it remains in PB. The PU is removed from PB at the presentation 
lime of the temporally subsequent PU or. if this does not exisu at the end of lifetime of the AV object 

With these model assumptions the encoder may freely use the space in the buffers- For example jt may 
transfer data for several access units of a non-real time siream to the decoder and pre-store them in the EB 
some time before they have to be decoded. Then the full multiplex channel bandwidth may be used to transfer 
data of a real time stream just in time afterwards. 




Fi's. II. SyMcm decoder model 
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Iht PB may be eg. as reordering buffer for early decoded P-frames which are needed by ihe video 
decoder for decoding of intcnncdiate B-framet before the prescntaiion time of ihc P-framc arnvc$. PB can 
also be used for any other decoded AV dau thai needs to be avaUaWe for some longer period of umc (static 
badcgrouod images, synthetic model descriptions, etc). , 

The encoder has to signal the required buffer resources to the decoder before startmg the transnusnon. 
TTiis can either be done explicitly by requesting buffer sizes for individual elemcnury streams or AV objwts 
or for the whole session, ll can also be done implicitly by specification of an MPEG-4 profile and levd. 

d2.2. System timing model u u 

. AV objcctt being transmiitcd and presented in real Ume, have a timing model in which the eml-to-end 
delay from the signal input to an encoder to the signal output from a decoder is a constant This delay is the 
sum rfall processing and buffering delays. Encoded dau contain implicit or explidt nmmg mformation to 
convey the interval between successive access units. Implidt timing information could be. c.g^ a constant 
coded frame rate. . 

The AV objecu in one session may be encoded by different encoder systems which, in general, run at 
slightly different clock speedl So. each A V object has its own Object Time Base (OTB^ The speed of the 
System Time Base (STB) of ah MPEG-4 decoder system wiU not be synchronized a pnon with any of these 
0TB. Therefore, Object Qock Reference (OCR) time stamps are introduced in the bitstream to convey the 
speed of the encoders time base($) to the decoder. They must be transmitted frequently enough so that the 
decoder can track the 0TB. . 

Decoding Time Stamps (DTS) and PrcsentaUon Time Stamps (PTS) arc used to indicate decoding and 
presentation lime of a given access umX measured in units of the OTa To ensure synchronized decodmg and 
presentation of these AV objects, all explidt or implidt liming information has to be mapped to the STR 
yidding a system ptcscntation time ttrr from the known object presenUiion time forr by the following formula: 

with 

r$rT decoder's system presentation time measured in units of isn t 

ttn decoder*s System Time Base. 

tort AV object presentation time measured in units of roia- 

foTs AV object encoder's time base. 

'ctmta«t value of decoder's STB when the first OCR time stamp of the AV object is encountered, 

fofa^TART value of the first OCR time stamp o( the AV object . 

Similarly, a system decoding lime ffor can be computed from the known object decoding time foor* 
Note that a system may be operated without transmission of timing information. In that case the system 

decoder model cannot be applied i-c the encoder cannot know how buffers are used and when access units 

are processed by the decoder. Tight synchronization is therefore not possible. 



6 J, Elementary stream interface 

It is desirable to define an interface for M PEG4 data to the underlying transport layer, that is independent 
of this layer itself and. more predsdy. hides ihe spedfics of any transport layer from the MPEO-4 system 
implementation. 
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This interface is called ihc Elementary Stream (ES) Interface. AB data arriving at or leaving an 
MPEG-4 terminal arc transferred by method calls to InElemStrcam or OutElemStream objects, 
respectively. Additionally, the API for these classes has to be designed so as to provide all the necessary 
configuration parameters for OulEIemSircams and to return afl required status information m case of 
InElemStreams. 

The predse parameters of this interface are still under investigation. Two major issues need to be molvcd: 
how are quality of service requirements communicated and what is the minimum mlerfacc to be stan- 
dardized. _ 

A certain quality of service may be cither provided by explicitly insening adapters that perform a specific 
task, eg, forward error correction, or by a more generic interface that requests the QoS in terms oC eg^ the 
permissible residual error rale, jitter, delay, stream priority, etc This leaves the task to the API implementa- 
tion to ensure compliance to these QoS requirements. 

The OutElemStream interface may include the following methods: 

- setBufferSize - sets the size of the buffer associated with this stream instance. 

- put - output a number of biu or bytes to the stream, 

* accessUnitStart * label this bitstream position as start of an access unit. 

- decodingTimestamp - assign a decoding time to the current access unit, 

- presentationTimeStamp - assign a presentation time to the current access unit, 

- openChild - esublish a new OutElemStream as child of the current stream. 

- child - get a handle to a (already open) child of the current stream. ^ 
Calls to the put method transparently write user data while calls to the accessUnitStart and time- 
stamp methods lead to the generation of an access unit header that is inserted in the bitstream by the 
multiplexer. 

The InElcmStream cliss u conceptuaUy simpler since its main function is to allow retrieval of data. Its 
method^ include: 

- gd - returns a number of bits or bytes from tKs elementary stream instance. 

- accessUnitlndex - returns the index of the cunent access unit. 

- decodingTimestamp - returns the decoding time of the current access unit. 

- presentationTimeStamp - returns the presentation time of the current access unit. 

- child - returns a handle to a child of the current stream (null if child does not exist). 

Further API methods may include alignment of the read pointer to a datum boundary, methods to return the 
current stream buffer fullness.and the state of the stream. 

Elementary streams have a hierarchical relationship that is determined by the MPEG-4 content designer. 
The ES interi^ace ensures that streams can be accessed by their unique hierarchical name, hiding the name 
spaces for logical transport channels from the MPEG-4 system. The correspondence between both name 
spaces must of course be conveyed from multiplexer to demultiplexer. The number and properties of the 
elementary streams depend on the desired number of concurrent data channels and transport characteristic. 
Data for one AV object may be distributed on more than one ES with different QoS. e.g.. if this A V object is 
compressed with an algorithtn that allows for scalalnlity. 

From the systems perspective, elemenury streams are information entities that are structured into 
access units, consisting of some visible attributes^ like their decoding and presentation time, followed 
by a number of consecutive data bits. All coded information that U intended to be manipulated or 
edited on a system level fbiutream editing*), without need to parse or decode this information, has to be 
delivered in elementary streams. Of course, it is at the discretion of the elementary stream user to apply 
a proprietary multiplexing of some dau streams if no system visibility of these streams is required. 
Compression efficiency may be a reason for this choice, however, this is not in the spirit of M PEC-4 which* 
aims to provide easy A V object accessibility. The low overhead multiplex described next tries to support this 
spirit 
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6,4, A rwo*layer muitipiexMpproach 

The list of requirements shows that, in terms of the OSI layer modcL the multiplexing and sy nchronization 
pan of MPEG-4 is centered around the uanspon layer (multiplexing, QoS choice) with additional session 
laver (synchronization) and network layer (retransmission) functionalities. 

Some of these functionalities are also provided by existing transport protocol sucks for LANs, ATM, 
ISDN or PSTN, digital video (DVB) or digital audio {DAB) broadcast, to name a few. Some protocols take 
into account real time issues like synchronizaUon (H^25 [13] for TCP/IP LANs), bandvwdth resource 
reservation (ATM) or errpr resilience (H^23 Annex A [I2]l All provide some kind of temporal mtcrlcavmg 
of data streams, albeit with vastly different sizes for the data cells or packets to be multiplexed. 

A two-layer multiplex Approach, as depicted in the gray shaded part of Fig. 12. has been designed that 
separates the functionalities in order to facilitate the interface to all these transport environments. The goal is 
to exploit their characteristics while adding functionalities that these environments lack, and always 
preserving the homogeneous interface towards the MPEC-4 system as described in the previous section. 

Data streams with similar quality of service requirements are first multiplexed on a content multiplex layer 
(Section 6.4.1) that eflfidcnily inierleam data from a variable number of variable bit^ratc streams, frames 
access units and contain^ $ynchronizaiion informatioa A service multiplex layer iScction 6.4.2) supplies 
channels with a variety of quality of serxicc and provides reliable framing of its content (the content 
. multiplexed data) and error detection. 

Appropriate protocol layers of specific networks, as shown on the lower right part of Fig. 12, may 
substitute this sen-ice multiplex. The protection sublayer (PL) can be seen as an example implementation of 
' the desired quality of service functionalities that may be implemented if the uiiderlying network protocol 
does not provide them. 

6,4. L Contwi multiplex layer 

The purpose of the content multiplex layer is an efficient temporal interleaving of a varying number of 
\'ariable bit-rate elementary streams with a low multiplex delay. Each elementary stream is transported in 
a separate logical channeL 

The content multiplex packet is of variable length and consisu of an index. /. the length of the payload L, 
and the packet payload. as indicated in Fig. 13. In the direct addressing mode U > .V) the logical channel 
ILC) to which the complete packet payload belongs caa be computed directly from the index /• as shown in 
Fig. l3(a)L In the multiplex uble mode (/ < .V) / is a pointer to a multiplex table entry Till that defines how 
a payload is shared between m logical channel (Fig. 13(bl). Currently .V • 16 values are reserved for the 
multiplex table mode. 
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Fig. 14. AppUcaiion of multiplex table mode. 



A multiplex table entry Tj(l\ which describes how a packet is shared has to be defined with a 
suitable configuration protocol before it is used. This mode, which is adapted from ITU-T multiplex 
recommendaUons HJ23 [1 1] and HJ23A [12]. attows to further reduce the multiplex overhead under some 
conditions. This is tllustrated in Ftg. 14 that assumes an application with several concurrent streams, c.g^ 
with animation parameters for synthetic objects. Each set o\ animation parameters may only be a few byte at 
a lime, but to be updated regularly, cfr, every n milliseconds. Data from other streams may be interspersed 
In multiplex table mode a table entry T(/) may be defined to multiplex parameters for all objects in one 
content multiplex packeu as indicated in Fig. 14(b) for logical channels A^B and C saving sotnc packet 
headers compared to Rg. 14(a>. Data in logical channels X and r are always multiplexed in direct addressing 
nnode. 

Savings in multiplex overhead and possibly delay, occur if data rates are sufficiently predictable. If the 
multiplex table entries need to be redefined often, the advantage may disappear. 

Each portion of the content multiplex packet payload belonging to a distinct logical channel {At. Bt. etc) 
may have a further adaptation sublayer header. Functionalities to frame access unil8» to carry the object 
clock reference timesUmps and sequence numbering to detect lost adaptation layer s>ackets can be con- 
figured here. 
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6.4.2. Service muMpiex layer . , . . . . . 

The currently specified service niuUiplex Uyer U in many respecu a generaliza «on of featuret found ip 
rrO-T H.223A [12]. This multiplexer has been designed for error prone channels and therefore features 
robust packet synehronizaiion and constant packet length. On the protection soWayw (PLX framing, error 
detecUon and forward enor correction took using convolutional coding ai« mduded. Furthermore auto- 
matic resend lequesu and convolutional interleaving are available. Even more than m the original H.i-JA. 
all these functionalities are configurable. ^ , ' 

Recent discussions show that it has to be studied for which transmission or storage njedia this serwe 
multiplex layer can be implemented as is. Knowing thai the number of multiplexmg layers should be 
kept small it has to be taken into account that many networks (see introductory remarks in Section Mj 
already provide some of the functionalities suggested here. Furthermore each network has its ow-n error 
characteristics and therefore needs specific tools for error protection. Therefore it can be assumnl that this 
layer wiU evolve to « set of interlace specifications for an those networks that are of relevance to 
MPEC-4. 



6.5. Content management 

Contem management does not appear to be a multiplexer task; however, some informadon about the 
contenlof the muhiplexed elementtry streams is obviously necessat>- to allow object-onented functionalities, 
like bitslream editing^ to be performed on the MPEG-4 system level. = . ^ - 

In MPEO-2 this task is addressed from two sides. Program Specific Information (PS!) »deBnedto convey 
a minimum of information to be able to assign elementary streams to their respective reopients. The unique 
assignment between elementary streams and AV objects is implemented in MPEO-4 by the invention how 
to access elementary" streams via the ES interfcoe in conjunction with a transmission of the necessary 
correspondence information, which is much alike MPEG-2 PSI. except that oo semantic information about 
the content of the elementary streams Cdescripiors") is conveyed here. These descnplors in M PEC-2 are seen 
as a potential source of ambiguity, as they mostly duplicate information that u already present in the 
elementary streams themselves. .... j j- • i 

In MPEG-1 as a second s^e^ a more exhaustive content management is realized via an additional 
protocol DSM-CC [iO], that allows for, eg., the selection of programs in a Video' on Demand application. 
While this functionality is only useful for part of the digital TV broadcast applications that make use of 
MPEG-X it is much more central for MPEG-4 that is supposed to have a focus on interacuvity. Sigmficant 
activity is still needed to esublish a framework, probably based on DSM-CC to descnbe the content of 
multiplexed MPEC-4 streams as well as to embed protocoU to configure MPEG-4 communicalion sessions 
and to implement user interactivity. 

6,6. Future 

The proposed synchronization and muluplcxing framework is based on a homogeneous clcmcmary 
stream interface towards the MPEG-4 sysian, combined wih a system decoder model, comparable to 
MPEG-2/Sysiems but modified according to the MPEG-4 requirements. Underneath the clementaiy stream 
interface a two-layer multiplex approach is implemented, separating a flexible content multiplex from 
a service multiplex layer. 

The work on this pari of MPEG^ Systems U still very new. so thai this outline, even more ihan the other 
pans, can only be seen as a snapshot of on-going work. Future work has to address implementation and 
testing of the layered multiplex specification and a refinement of its funcilonaliiics. Adaptation to relevant 



networks has to be defined A major task, not only in the mulUplex area, will be the definition of a description 
and signaling framework, that couW also be used for multiplexer and session configuration. In this context 
also the homogeneous integration of broadcast and point-to-point applications will be further pursued, 

7* Conchisioa 

The MPEG-4 S>^tems and Description Unguages is based on contemporary object-based techno- 
logy. Unlike previous digital or analog \ideo $>*stems. the MPEG-4 data represeniaiion is on 
natural audiowsual objects, not the lunnaiural) audio and video frames used for prescniatton. So MPEG-4 
Systems defines AV Objects that contain data and the methods to render and composite the o^ccts. AV 
Objects are hierarchical in general. The top object, in a given audiovisual hierarchy is the scene object TTie 
generic MPEG-4 scene is a 3D space of arbitrary size changing dynamicaUy over time. PresenUt|on of this 
scene consists of defining the viewport into the scene, rendering each object and compositing the objecU onto 
this viewport MPEG-4 Systems provides the structures and interfaces to support this model 

In addition, efficient coding of the AV objects is required, so process objects arc defined to peiform this 
function. The coded objects are converted to a bitstream format that can be configured the Syntactic 
Description Language, and finally, the objects and associated control data are multiplexed together for 
storage or transmission. 

As a rc$ult MPEG-4 Systems provides an environment wiih considerable flexibihiy. This makes it 
adaptable to many applications, and allows modification of individual components such that they may be 
tailored to specific operating conditions. The environment is also highly interactive. Separation of the 
presentation from the coding structure allows interactive selection of the presentation vicwpoa The natural 
object structure allows interactive manipulation of the audiovisual objects. j Jt ■ - 

MPEG-4 Systems attempts to provide a solution where the minimum is specified. It dcfinips, ju^ the 
underlying data structures, and the APIs for each component of the complete solution. In thi* way, max^um 
freedom is provided to the algorithm, architecture and system designers. 

Acknowledgements 

The authors would like to express their deep appreciation to the MPEO-4 Systems sub-group. The major 
part of the presented material has been discussed and constructed from the various conlnbuiions made 
within the context of this group, including many fruitful discussions with M PEO-4 experts. The authors hope 
. that this paper laithfully represents the work and the spirit of this group, founded on cooperation and the 
open exchange of ideas, and that it unll ser\'e as a good vehicle for promoting the understanding and further 
de\*clopment of MSDL 

References 

ft) ISO.IEC JTCI SC29;WCII NU»5. MPEO-4 Rcquifcmenti Veriion 1.1 Macctd BrasU. Notxmbei IW. 

P) ISO..IEC JTCI SC» WC 1 1 M .^TS, MPEO-* Audio Verificnion l^odcl la Cliicata tSA, Sepiember-Ociobcf 199*. 

L3] IS0.1EC JTCI SC:9,^Cll K|4», MPEC-I Vidto VcrificaUon Model SA M»ccid BrasU. No\wbef l«6. 

L4] ISO.IEC JTCI won NUM. MPEG-I SNHC VerificMion Model 5A Mieeio. Bnwa. Not^nlwr 

ISO,1EC JTCI;SC».WGI I NI4SX MPEC-4 S)^enu Woiking I>r«ft Version 2A M»cod, BikH Ko^OTber 
t6) 1SO.IEC JTCI SC:9.WGll N14IU, MPEC-4 S>7temt Verifiatton Model Vcnion 2.a Mocttd BmiiU Nw^ember IW6. 
[71 ISO Inwmaiionat Standard I JKIS-U .MPEG-2 Synemt. 1994. e :c • aw 

18J ISO Intenaitonal SumJard 8824. Inforoiaiion Proctttinf S>«em$ - Open Syftemj Imcicotweciioii SpecifieaiKm of Abstrsct 

Synux Notaihin One i ASK. I L I V<)a 



430 



(). Aran et at SffuU Prettssmz: fmage CcmmwUeaikm 9 fi997} S»S'4Sl 



P] ISO Inienwiiwu! Sumltrd W23l Infonntilon Rrocming Syttems • Open Sy»iem» Intefconneciioo - Speciftcauon of Baf» 
Encoding Rules for Atetraa S>nux NouOon One (ASN.U I Wa ^ ^ ^ . , itiait 

[10) ISO Tntenwuoiul Sundard 13818^ MPEC: ExieaiioA for Digiul Storage Medu Comnund and Ccwtrol IDSM-TO. ITO. 

CI IJ rrU-T Recommcnd»iion H.MJ^MuUijrfoiing Protocol For Low Btinte MullimcdiA Commomcaiion. 

tl23 ITU-T Draft Recommcndaiion Amwi A. Multlptoung Protowl for to»- Biiratc Mobile Muhunedia Co«»muiUc»tw^ 

[13] ITU-T Orafi Recommendwion H.225. Media Stream Packetuaiion and SjTichromiaiion on Non<3uarameed Quality or Service 
LANs. 1996. 

(14] Rumbaugh «i Objeci^rienied Modeling and Design. Premice-HalL Englewood CWft. NJ, 1991. 



Olivier Araro was born in Provence, France on 27 July 1968. He received his DipL Ing. 
degree in telecommunicaiton engineering in 1992 from ihc Higher School of Telecommu- 
nicaiion of Brittany. After joining France Tclccom-CNET department on image commun- 
ication in 1991 he worked on image representolion techniques and analysis. His research 
areas cover video compression algorithms, error resiliency of video compression algo- 
rithms, invariant representation of images and shapes and model based representation for 
interpersonnal communicaUons. He has been cariy involved in the ISO/MPEG-4 project, 
in particular through the European platform MAVT and its successor MoMuSys. He is 
currently chairman of the MPEG-4 Systems subgroup. 



Philip A. Chou was bom in Stamford. CT on 17 April 1958. He received the BS.E. degree 

■ from Princeton University in 1980 and the M.S. degree from the Uniwrsiiy of California, 
ferkeley. in 1983, both in electrical engineering and computer science and the Ph.D. 
degree in electrical en^neering from Stanford University in 1988. From !977 through the 
present, he worked for JEM. AT&T BcU Uboratorics. Princeton Plasma Physics Ub. 
Telesensory S>-stems. Speech Plus. Hughes, and Xerox, where he was involved variously in 
oflkc automation, motion estimation in television, optical character recognition. LPC 
speech compression and synthesis, text-io-spcech synthesis by rule, compression of 
digitized terrain^ speech and document recognition, and image and video compression. His 
research interests arc pattern recognition, data compression and speech, image, and video processing. He b 
the recipient, along ^-ith Tom Lookabaugh and Robert M.Cray.of the 19931EEE Signal Processing Society 
paper award. He has served as a guest editor of the IEEE Transactions on Image Processing, and ais 
a consulting associate professor for Stanford University in 1995. Currently, he is with the Xerox Palo Alto 
Research Center in Palo Alto. CA. Dr. Chou is a member of Phi Beta Kappa. Tau Beta Pi. Sigma Xi, and 
IEEE Computer. Information Theory, and Signal Processing societies. 



Biographies 




a Arttro n aL f Siffiai Proctssing; fma^ Commmkotioa V tl997t JrfWJ/ 



4>| 




Alexandros Elcftbemdis was born in Athens, Greece, in 1967. He received the Dtploma in 
Electrical Engineering and Computer Science from the National Tcchmcal Umvcrsity of 
Athens. Greece, in 199a and the M^.. M.Phil and Ph.D. degrees in Elcctncal Enginanng 
from Columbia University. New York, in 1992, 1994 and 1995 respectively. Since 1995 he 
has been an Assistant Professor in the Department of Electrical Engineering at Columbia 
University, where he is leading a research team working in the areas of video signal 
processing and compression, video communication systems <including video-on-demand 
and Internet videok distributed multimedia systems, and the fundamentals of compres- 
___ sion. During the summers of 1993 and 1994. he was with AT&T Bell Uboratones. 
Murray Hill NJ. developing low bit-rate model-assisted video coding techniques for videoconfcrcnang 
applicauons. From 1990 until 1995, he was a Graduate Research Assistant in the Department of Eleancal 
Engineering at Columbia University. Prof, eefiheriadis has served as a guest editor, committee member, and 
organizer for several international journals and conferences. He is a member of the ANSI X3L3. 1 Committee, 
and is participating in the ISO; lEC JTCI /SC29/WG 1 1 (MPEG) standardization activity as well as DA VIC 
Prot Elefihcriadis is a member of the IEEE the ACM. and the Technical Chamber of Greece. 

Carsteo Hcrpcl was bom in Cologne, Germany on 28 September 1962. He received his 
Dipl. Ing. degree in electrical engineering in 1988 from the University of Aachen, After 
joining THOMSONS research facility in Hannover. Germany, he worked on video 
compression algorithms, co-developing an algorithm proposal both for ISO/IECs 
MPEG-I and MPEG*2 standardization projects. This work was embedded in the Euro- 
pean COMIS and VADIS projects as wcU as in the German HDTVT project, where his 
MPEG-2 work focussed on dc^-elopment and comparison of hierarchical coding ap- 
proaches for terrestrial HDTV broadcast s>'stems. He currently works on Systems issues 
ifi the ISO/MPEG-4 project 

Cliff Reader has over 20 years experience in distal video coding, image processing, and 
real-iimc'digital video systems design. His rcwarch explored the use of transform coding 
for video in the eariy '70s. He has devxioped products for various imaging markets, most 
recently concentrating in the emerging field of Consumer Digital Video. Since 1990 he has 
been an active member of the ISO^EC MPEG community, including being Head of the 
US Delegation for two years, and leading the MPEG-4 activity from inception. He is also 
the technical expert on MPEG intellectual property, assisting in the establishment of the 
MPEG Patent Pool Dr. Reader has been ^ith Samsung Semiconductor for the past three 
years as Associate Director for Strategic Marketing, ^iih responsibiiiiy for the MSP 
product 

Julien Stgnes received his Engineer Degree from Ecole Polyicchniquc in Paris in 1992, 
then he joined Ecole Nationale des Telecommunications in Paris where he obtained 
a Telecommunication Engineer degree. He has been working at CCETT (Joint research 
center for broadcast and telecommunications! in Rennes as a member of the Corps 
Interministeriel des Telecommunications since 1994. His research areas cover Image and 
Video coding, as well as object oriented design for image proccs-^ng and multimedia 
applications, in the context of European projects and ISO/ MPEG-4 standardization 
group. 






This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 

BEST AVAILABLE IMAGES 

Defective images within this document are accurate representations of the original 
documents submitted by the appUcant. 

Defects in the images include but are not limited to the items checked: 

□ BLACK BORDERS 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 

□ FADED TEXT OR DRAWING 
^^BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

^QTcOLOR or BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

^^P4,INES OR MARKS ON ORIGINAL DOCUMENT 
^.QTrEFERENCECS) or EXHIBIT(S) SUBMITTED ARE POOR QUALITY 
. □ OTHER: ] 

IMAGES ARE BEST AVAILABLE COPY. 
As rescanning these documents will not correct the image 
problems checked, please do not report these problems to 
the IFW Image Problem Mailbox. 



